- (If you need to have the data I used in these examples, please shoot me a note. I'll be happy to email the csv files to you, but it is better if you can use the same ideas - and maybe even the same code - for your data)
I will make them normal!
– Anonymous, in response to the news that some
distributions cannot be normal.
What is an outlier?
According to www.dictionary.com, an outlier in its
statistical sense is
“…
an observation that is well outside of the expected
range of values in a study or experiment,
and which is often discarded from the data
set…“
And here lies one of the most common temptations that we face
when dealing with real-world data: the dictionary merely states it, but all too
often we tend to discard outliers, even though more often than not, they are “in
the eyes of the beholder”.
Normal Distribution and Outliers
One of the strange things about data mining is that most of
the theorems and axioms start with, “If X is a normally distributed random
variable…” In this section, we are
going to discuss what happens “if X is INDEED a normally distributed random
variable”.
Without going into details of what is normal and what is not
(for definitions, the interested reader may be referred to the Wikipedia and
other, more scientifically accepted sources), we shall just say here that normal
distribution is a very convenient abstraction for which the mathematics is all
worked out, and the often misunderstood Central Limit Theorem makes sure that
the data analysts tend to assume that with enough data, it will all look
normal:
Then an outlier is very easy to detect: just look for data
that do not fit into the normal distribution.
The mathematics of outlier detection for normal distributions, or from
samples taken from a normal distribution, has all been worked out and has been
in use successfully for the last 100 years.
All you need to do is compare the number
or the average – mean – of a group of numbers – with the rest of the
data, and if the probability that you are outside the range of the data is less
than your desired cutoff “p-value” (typically 5%), then you have an
outlier.
The comparison is done using
the t-test, where the T value is calculated as
Where X =
the data point of interest; µ = mean (average) of the sample; S = its standard
deviation, and N = sample size, and from the t-table distribution one can obtain
the critical value of T that will correspond to the desired confidence interval
and the number of observations in the sample, N.
In R, there is a function that will do all the math for us;
it is called t.test();
however, math works in both directions, and the definition of T is not an
exception: it gives an important way to compute the minimum sample size that one
needs in order to be able to tell the difference between means with a given
confidence level.
In
R, the qt()
function
will return the desired value for Tcr.
But
back to t.test():
If
we have two samples taken from normal distributions (and such samples do not
have to be normal), then we can answer the question whether they have the same
mean by executing the t.test()
function:
set.seed(1234567) ##to have the same vectors for X, Y, and
Z every time
X
= rnorm(1000, 1, 0.25)
hist(X,
cex = 0.75, main = "Histogram of a Normally distributed random
variable")
Y
= rnorm (25, 1, 0.25)
t.test
(X, Y)
Z
= rnorm (25, 15, 0.25)
t.test(Z,
X)
The output of this program will produce (in addition to the
histogram above):
>
t.test (X, Y)
Welch Two Sample t-test
data: X and Y
t
= 0.0534, df = 25.419, p-value =
0.9578
alternative
hypothesis: true difference in means is not equal to 0
95
percent confidence interval:
-0.09190418 0.09680157
sample
estimates:
mean
of x mean of y
0.9955788
0.9931301
>
t.test(Z, X)
Welch Two Sample t-test
data: Z and X
t
= 294.1182, df = 25.309, p-value <
2.2e-16
alternative
hypothesis: true difference in means is not equal to 0
95
percent confidence interval:
13.92130 14.11752
sample
estimates:
mean of x mean of y
15.0149879 0.9955788
R output typically is defined by the package authors, and it
is geared towards human-readable form.
How can we programmatically extract useful information from this?
Fortunately, the object returned by the t.test() function is a list, which we know how to
address:
XY
= t.test(X, Y)
names(XY) will give us all the
elements of this list:
>
names(XY)
[1]
"statistic"
"parameter" "p.value"
"conf.int" "estimate"
[6]
"null.value"
"alternative" "method"
"data.name"
And we can write:
>
if (XY$p.value >
0.05){print ("The X and Y are the same")}
[1]
"The X and Y are the same"
>
For data sets containing multiple vectors or columns of data,
it is inconvenient to write the same line over and over again: if you have to do it more than once, write a program. In general, in data mining, we want to reuse
as much code as we possibly can.
The
best way to do it is by using a function.
It is generally best practice to use the principles of
abstraction used in higher-level programming languages, like C, C++, C#, and
Java. We recommend defining all the
functions in an R file separate from the one using them.
We will define the function AreTheySame() in a file
“foo.r”
AreTheySame
= function (x, y, pValue = 0.05, Print_TF = FALSE){
XY = t.test(x,
y)
If (Print_TF)
{
if (XY$p.value > pValue) {
print
("They are the same")
} else {
print
("They are the different")
}
}
return (XY$p.value > pValue)
}
(All functions in R return a value, which is typically its
last line, but R also supports an explicit
return () function. For code readability, it is better to force a
return explicitly)
Now all we need to do is tell R that it needs to execute the
code saved in file foo.r:
source
(“foo.r”)
print
(AreTheySame (x= X, y = Y, pValue = 0.05, Print_TF =
TRUE))
[1]
"They are the same"
[1]
TRUE
>
Practical Application
1: T-test
After an upgrade and a split of a large computer server pool
into several smaller pools, we want to see whether the CPU utilization on each
of them has changed compared to the previous one large pool.
We have collected the data into a csv file, with columns corresponding to the pools, first
column being the original pool and the rest, the new pools. The data are all collected under the same
demand conditions; so it is appropriate to execute a t-test for each of the new
pools, since we are not looking for pool-to-pool uniformity (that would be the
F-test, or ANOVA).
Here is the R code:
Data = read.csv("CPU.csv")
head(Data) #
It’s always good practice to familiarize oneself with the layout of the
data
source ("foo.r") ##
Always make sure all functions are defined before they are
called
Names = names(Data)
### The first column is just the
row number; we shall ignore it; the second is the original CPU utilization
data
CPUDidNotChange= sapply (3:length(Names), function
(x, cpuData){
orig = cpuData[,2] ##
2nd column is the original CPU
utilization
pool = cpuData[,x]
## x is the argument passed to
the function ## defined here
return (AreTheySame(x = orig, y = pool,
pValue = 0.05,
Print_TF = FALSE))
}, cpuData =
Data
)
Notice that the x passed to the anonymous function inside the
sapply is not the same as the x passed into the AreTheySame() function. Also notice that there is a third argument
passed into the sapply() function; that is the second
argument that the anonymous function uses.
There is one point that shows bad R programming practice: the
orig and the pool variables are not being removed from
the namespace, and there is no garbage collection executed. Also, having big anonymous functions inside
the R sapply()
make the code very hard to read and debug.
We are going to move this function into our foo.r file and make sure all garbage is collected properly.
CPUDidNotChange
= function (colNum, cpuData){
##
For debugging, it is good to give some
##
values to the parameters being passed and
comment out those lines
##
This way, when debugging, we can uncomment these lines and run
##
the function step by
step
# x = 3;
# cpuData = data.frame ( X = 1:10000,
#
Orig = rnorm(10000, 0.75,
0.15),
# Pool1 =
rnorm(10000, 0.76, 0.15)
#
)
#######################################################################
orig = cpuData[,2] ##
2nd column is the original CPU
utilization
pool = cpuData[,colNum] ##
colNum is the first argument passed to ## sapply function
DidNotChange =
AreTheySame(x = orig, y =
pool,
pValue = 0.05, Print_TF = FALSE)
rm (orig, pool); gc()
DidNotChange
##An alternative way of forcing a return of the
value
}
And then the call to this function from outside becomes a
one-liner:
AreTheyAsOriginal
= sapply (3:length(Names), FUN = CPUDidNotChange, cpuData =
Data)
Checking:
>print (AreTheyAsOriginal)
[1] TRUE FALSE FALSE FALSE
We will let the reader figure out how best to combine this
result with the Names variable which corresponds to the column names in the CSV
file.
The CSV file (CPU.csv) is available upon request.
Practical Application
2: Load Uniformity analysis
A very important metric in IT, in chemical and even nuclear
engineering is the load uniformity across multiple resources (servers in a pool,
pumps in a pumping station, etc).
Uniform load ensures optimal utilization of resources and is therefore
the desirable way of running parallel load servers. The best way to verify uniformity is by using
Analysis of Variance, or ANOVA (F-test).
We will not go into the mathematical details of ANOVA here; there are numerous better descriptions of the method online, e.g., http://www.itl.nist.gov/div898/handbook/eda/eda.htm, but we will demonstrate its use with the CPU.csv file described above. We shall determine whether the 4 new pools are running at a uniform CPU utilization.
The R function for ANOVA is aov() . The aov() function, however,
requires special data preparation: in our case, it has to be in two columns, the
CPU and the column name. We do need to prepare the data for that, and
we will do it in the now familiar foo.r file (I am intentionally not using the "reshape" package in this example: the goal is to use "minimalist" R)
PrepareDataForANOVA
= function (colNums, cpuData){
Names = names(cpuData)
rDFs = lapply (colNums,
function (x, CPUData){
return (data.frame (CPU = CPUData[,x], Pool = Names [x]))
}, CPUData
= cpuData
)
retDF = do.call('rbind', rDFs)
rm(rDFs); gc()
return (retDF)
}
Here we see a relative of our old friend the sapply()
function. The lapply
()
is different in that it returns an object of type list. We will take advantage of that, as we will
use the R ability to apply a function call to all elements of a list: the do.call() function.
So the lapply()
will return a list whose elements (rDFs)
are data frames with the CPU utilization in one column and the pool name in the
other. When we combine them, using do.call
(‘rbind’, rDFs). It
will, very memory-and-runtime-efficiently, combine all rDFs into one retDF.
After that, rDFs are of no more use
to us; so we remove them from memory and call the garbage collector to return
the memory to the OS.
That done, we can build a function that will be calling it
and processing the data to return the yes or no result: is there uniformity in
the operation of these pools, based on their CPU consumption?
IsCPUUniform
= function (colNums = 3:6, cpuData){
source
("foo.r")
OneDF = PrepareDataForANOVA(colNums, cpuData)
cpuANOVA = aov(CPU ~ Pool,
data = OneDF)
smry = summary(cpuANOVA)
}
We need to load the latest version of all the functions, even
if it is in the same file.
Then we prepare data for ANOVA using the just-created
function.
And then we apply the aov() function to the result
of the preparation function.
However, if at this point we say inside the function,
print(cpuANOVA)
R will say:
Call:
aov(formula = CPU ~ Pool,
data = OneDF)
Terms:
Sum
of Squares
708.6629
906.9735
Deg.
of Freedom 3 39996
Residual
standard error: 0.1505875
Estimated
effects may be unbalanced
>
names(cpuANOVA)
[1] "coefficients"
"residuals" "effects" "rank"
[5] "fitted.values"
"assign"
"qr"
"df.residual"
[9] "contrasts" "xlevels" "call" "terms"
[13]
"model"
>
Nothing here that will tell us what the uniformity is. Then it’s R to the rescue: the summary() function:
>
smry = summary(cpuANOVA)
>
smry
Df Sum Sq Mean Sq F value Pr(>F)
Pool 3 708.66 236.221 10417 < 2.2e-16
***
Residuals 39996 906.97 0.023
---
Signif.
codes: 0 ‘***’
0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
Notice how R marks factors that have a significant effect
with asterisks. In our case, it means
that there is no uniformity among the pools, and the pool factor explains more
than 95% of the overall variance.
However, summary of an aov object is a strange
animal. It’s a list with only one
element, which is a data frame; so the proper way of extracting the useful
information from it is smry[[1]] To make things easier, we shall rename the
last column.
Final look of the function:
IsCPUUniform
= function (colNums = 3:6, cpuData){
source
("foo.r")
OneDF = PrepareDataForANOVA(colNums, cpuData)
cpuANOVA = aov(CPU ~ Pool,
data = OneDF)
smry = summary(cpuANOVA)
Names = names(smry[[1]])
names(smry[[1]])[length(Names)] = "P_value"
rm(cpuANOVA, OneDF);
gc()
return(as.data.frame (smry[[1]])$P_value[1] > 0.05)
}
The return line deserves a little more explanation. After the assignment of the new name for P
value,
> as.data.frame (smry[[1]])
Df
Sum Sq
Mean Sq F value P_value
Pool
3 708.6629 236.22097328 10416.95
0
Residuals
39996 906.9735 0.02267660
NA NA
>
R will first evaluate the smry[[1]], then cast it
(coerce) to data.frame
class, and finally take its P_value column and
compare it with 0.05. The higher F value
(and the lower P value), corresponds to the less uniform data, and we return
TRUE if it is uniform and FALSE if it is not.
Conclusion:
T-test and ANOVA (F-test), and many other statistical tests rely on the assumption that the underlying distributions are normal. While it is not always the case, the power and convenience of these tests, as well as the simplicity of their implementation in R, very often make it worth the analyst’s while to spend the time manipulating the data in order to bring the underlying distributions into normal or quasi-normal distribution.
- (From now on, nothing is going to be normal.)
No comments:
Post a Comment