Sunday, November 18, 2012

R and Outliers. Part 2: What if Everything is Normal?

  • (If you need to have the data I used in these examples, please shoot me a note. I'll be happy to email the csv files to you, but it is better if you can use the same ideas - and maybe even the same code - for your data)

I will make them normal!
– Anonymous, in response to the news that some distributions cannot be normal.

What is an outlier?

According to www.dictionary.com, an outlier in its statistical sense is
“… an observation that is well outside of the expected range of values in a study or experiment, and which is often discarded from the data set…“
And here lies one of the most common temptations that we face when dealing with real-world data: the dictionary merely states it, but all too often we tend to discard outliers, even though more often than not, they are “in the eyes of the beholder”.

Normal Distribution and Outliers


One of the strange things about data mining is that most of the theorems and axioms start with, “If X is a normally distributed random variable…” In this section, we are going to discuss what happens “if X is INDEED a normally distributed random variable”.

Without going into details of what is normal and what is not (for definitions, the interested reader may be referred to the Wikipedia and other, more scientifically accepted sources), we shall just say here that normal distribution is a very convenient abstraction for which the mathematics is all worked out, and the often misunderstood Central Limit Theorem makes sure that the data analysts tend to assume that with enough data, it will all look normal:


 

Then an outlier is very easy to detect: just look for data that do not fit into the normal distribution.  

The mathematics of outlier detection for normal distributions, or from samples taken from a normal distribution, has all been worked out and has been in use successfully for the last 100 years. All you need to do is compare the number or the average – mean – of a group of numbers – with the rest of the data, and if the probability that you are outside the range of the data is less than your desired cutoff “p-value” (typically 5%), then you have an outlier. 

The comparison is done using the t-test, where the T value is calculated as 

 

Where X = the data point of interest; µ = mean (average) of the sample; S = its standard deviation, and N = sample size, and from the t-table distribution one can obtain the critical value of T that will correspond to the desired confidence interval and the number of observations in the sample, N.

In R, there is a function that will do all the math for us; it is called t.test(); however, math works in both directions, and the definition of T is not an exception: it gives an important way to compute the minimum sample size that one needs in order to be able to tell the difference between means with a given confidence level.

In R, the qt() function will return the desired value for Tcr.

But back to t.test():

If we have two samples taken from normal distributions (and such samples do not have to be normal), then we can answer the question whether they have the same mean by executing the t.test() function:
 
set.seed(1234567) ##to have the same vectors for X, Y, and Z every time
X = rnorm(1000, 1, 0.25)
hist(X, cex = 0.75, main = "Histogram of a Normally distributed random variable")
Y = rnorm (25, 1, 0.25)
t.test (X, Y)
Z = rnorm (25, 15, 0.25)
t.test(Z, X)

The output of this program will produce (in addition to the histogram above):
> t.test (X, Y)
Welch Two Sample t-test
data: X and Y
t = 0.0534, df = 25.419, p-value = 0.9578
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09190418 0.09680157
sample estimates:
mean of x mean of y
0.9955788 0.9931301
> t.test(Z, X)
Welch Two Sample t-test
data: Z and X
t = 294.1182, df = 25.309, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
13.92130 14.11752
sample estimates:
mean of x mean of y
15.0149879 0.9955788

R output typically is defined by the package authors, and it is geared towards human-readable form.  

How can we programmatically extract useful information from this?

Fortunately, the object returned by the t.test() function is a list, which we know how to address:
XY = t.test(X, Y)
names(XY) will give us all the elements of this list:
> names(XY)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "alternative" "method" "data.name"
And we can write:
> if (XY$p.value > 0.05){print ("The X and Y are the same")}
[1] "The X and Y are the same"
>

For data sets containing multiple vectors or columns of data, it is inconvenient to write the same line over and over again: if you have to do it more than once, write a program. In general, in data mining, we want to reuse as much code as we possibly can.  

The best way to do it is by using a function.

It is generally best practice to use the principles of abstraction used in higher-level programming languages, like C, C++, C#, and Java. We recommend defining all the functions in an R file separate from the one using them.

We will define the function AreTheySame() in a file “foo.r”

AreTheySame = function (x, y, pValue = 0.05, Print_TF = FALSE){
XY = t.test(x, y)
If (Print_TF) {
if (XY$p.value > pValue) {
print ("They are the same")
} else {
print ("They are the different")
}
}
return (XY$p.value > pValue)
}

(All functions in R return a value, which is typically its last line, but R also supports an explicit return () function. For code readability, it is better to force a return explicitly)

Now all we need to do is tell R that it needs to execute the code saved in file foo.r:
source (“foo.r”)
print (AreTheySame (x= X, y = Y, pValue = 0.05, Print_TF = TRUE))
[1] "They are the same"
[1] TRUE
>

Practical Application 1: T-test
After an upgrade and a split of a large computer server pool into several smaller pools, we want to see whether the CPU utilization on each of them has changed compared to the previous one large pool.
We have collected the data into a csv file, with columns corresponding to the pools, first column being the original pool and the rest, the new pools. The data are all collected under the same demand conditions; so it is appropriate to execute a t-test for each of the new pools, since we are not looking for pool-to-pool uniformity (that would be the F-test, or ANOVA).
Here is the R code:

Data = read.csv("CPU.csv")
head(Data) # It’s always good practice to familiarize oneself with the layout of the data
source ("foo.r") ## Always make sure all functions are defined before they are called
Names = names(Data) ### The first column is just the row number; we shall ignore it; the second is the original CPU utilization data
CPUDidNotChange= sapply (3:length(Names), function (x, cpuData){
orig = cpuData[,2] ## 2nd column is the original CPU utilization
pool = cpuData[,x] ## x is the argument passed to the function ## defined here
return (AreTheySame(x = orig, y = pool, pValue = 0.05,
Print_TF = FALSE))
}, cpuData = Data
)
Notice that the x passed to the anonymous function inside the sapply is not the same as the x passed into the AreTheySame() function. Also notice that there is a third argument passed into the sapply() function; that is the second argument that the anonymous function uses.
There is one point that shows bad R programming practice: the orig and the pool variables are not being removed from the namespace, and there is no garbage collection executed. Also, having big anonymous functions inside the R sapply() make the code very hard to read and debug.

We are going to move this function into our foo.r file and make sure all garbage is collected properly.
CPUDidNotChange = function (colNum, cpuData){
## For debugging, it is good to give some
## values to the parameters being passed and comment out those lines
## This way, when debugging, we can uncomment these lines and run
## the function step by step
# x = 3;
# cpuData = data.frame ( X = 1:10000,
# Orig = rnorm(10000, 0.75, 0.15),
# Pool1 = rnorm(10000, 0.76, 0.15)
# )
#######################################################################
orig = cpuData[,2] ## 2nd column is the original CPU utilization
pool = cpuData[,colNum] ## colNum is the first argument passed to ## sapply function
DidNotChange = AreTheySame(x = orig, y = pool,
pValue = 0.05, Print_TF = FALSE)
rm (orig, pool); gc()
DidNotChange ##An alternative way of forcing a return of the value
}
And then the call to this function from outside becomes a one-liner:
AreTheyAsOriginal
= sapply (3:length(Names), FUN = CPUDidNotChange, cpuData = Data)
Checking:
>print (AreTheyAsOriginal)
[1] TRUE FALSE FALSE FALSE
We will let the reader figure out how best to combine this result with the Names variable which corresponds to the column names in the CSV file.
The CSV file (CPU.csv) is available upon request.
Practical Application 2: Load Uniformity analysis
A very important metric in IT, in chemical and even nuclear engineering is the load uniformity across multiple resources (servers in a pool, pumps in a pumping station, etc). Uniform load ensures optimal utilization of resources and is therefore the desirable way of running parallel load servers. The best way to verify uniformity is by using Analysis of Variance, or ANOVA (F-test).

We will not go into the mathematical details of ANOVA here; there are numerous better descriptions of the method online, e.g., http://www.itl.nist.gov/div898/handbook/eda/eda.htm, but we will demonstrate its use with the CPU.csv file described above. We shall determine whether the 4 new pools are running at a uniform CPU utilization.
The R function for ANOVA is aov() . The aov() function, however, requires special data preparation: in our case, it has to be in two columns, the CPU and the column name. We do need to prepare the data for that, and we will do it in the now familiar foo.r file  (I am intentionally not using the "reshape" package in this example: the goal is to use "minimalist" R)

PrepareDataForANOVA = function (colNums, cpuData){
Names = names(cpuData)
rDFs = lapply (colNums,
function (x, CPUData){
return (data.frame (CPU = CPUData[,x], Pool = Names [x]))
}, CPUData = cpuData
)
retDF = do.call('rbind', rDFs)
rm(rDFs); gc()
return (retDF)
}

Here we see a relative of our old friend the sapply() function. The lapply () is different in that it returns an object of type list. We will take advantage of that, as we will use the R ability to apply a function call to all elements of a list: the do.call() function.

So the lapply() will return a list whose elements (rDFs) are data frames with the CPU utilization in one column and the pool name in the other. When we combine them, using do.call (‘rbind’, rDFs). It will, very memory-and-runtime-efficiently, combine all rDFs into one retDF.

After that, rDFs are of no more use to us; so we remove them from memory and call the garbage collector to return the memory to the OS.

That done, we can build a function that will be calling it and processing the data to return the yes or no result: is there uniformity in the operation of these pools, based on their CPU consumption?

IsCPUUniform = function (colNums = 3:6, cpuData){
source ("foo.r")
OneDF = PrepareDataForANOVA(colNums, cpuData)
cpuANOVA = aov(CPU ~ Pool, data = OneDF)
smry = summary(cpuANOVA)
}
We need to load the latest version of all the functions, even if it is in the same file.
Then we prepare data for ANOVA using the just-created function.
And then we apply the aov() function to the result of the preparation function.
However, if at this point we say inside the function,
print(cpuANOVA)
R will say:
Call:
aov(formula = CPU ~ Pool, data = OneDF)
Terms:
Pool Residuals
Sum of Squares 708.6629 906.9735
Deg. of Freedom 3 39996
Residual standard error: 0.1505875
Estimated effects may be unbalanced
> names(cpuANOVA)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "contrasts" "xlevels" "call" "terms"
[13] "model"
>
Nothing here that will tell us what the uniformity is. Then it’s R to the rescue: the summary() function:
> smry = summary(cpuANOVA)
> smry
Df Sum Sq Mean Sq F value Pr(>F)
Pool 3 708.66 236.221 10417 < 2.2e-16 ***
Residuals 39996 906.97 0.023
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
Notice how R marks factors that have a significant effect with asterisks. In our case, it means that there is no uniformity among the pools, and the pool factor explains more than 95% of the overall variance.

However, summary of an aov object is a strange animal. It’s a list with only one element, which is a data frame; so the proper way of extracting the useful information from it is smry[[1]] To make things easier, we shall rename the last column.

Final look of the function:

IsCPUUniform = function (colNums = 3:6, cpuData){
source ("foo.r")
OneDF = PrepareDataForANOVA(colNums, cpuData)
cpuANOVA = aov(CPU ~ Pool, data = OneDF)
smry = summary(cpuANOVA)
Names = names(smry[[1]])
names(smry[[1]])[length(Names)] = "P_value"
rm(cpuANOVA, OneDF); gc()
return(as.data.frame (smry[[1]])$P_value[1] > 0.05) 
}

The return line deserves a little more explanation. After the assignment of the new name for P value,
> as.data.frame (smry[[1]])
Df Sum Sq Mean Sq F value P_value
Pool 3 708.6629 236.22097328 10416.95 0
Residuals 39996 906.9735 0.02267660 NA NA
>

R will first evaluate the smry[[1]], then cast it (coerce) to data.frame class, and finally take its P_value column and compare it with 0.05. The higher F value (and the lower P value), corresponds to the less uniform data, and we return TRUE if it is uniform and FALSE if it is not. 

 

Conclusion:

 
T-test and ANOVA (F-test), and many other statistical tests rely on the assumption that the underlying distributions are normal. While it is not always the case, the power and convenience of these tests, as well as the simplicity of their implementation in R, very often make it worth the analyst’s while to spend the time manipulating the data in order to bring the underlying distributions into normal or quasi-normal distribution.
  • (From now on, nothing is going to be normal.)
(To be continued...)

No comments:

Post a Comment