Sunday, November 18, 2012

R and Outliers. Part 3: Out of the (George) Box

The Inter-Quantile Range

In this section, we shall tackle the most important question with respect to outliers: what happens “if X is NOT a normally distributed random variable”.

While an enormous number of distributions can be considered normal, some do not fit into the assumption of the formalized mathematical tests, so much so that they can very easily mislead the unsuspecting analyst into the wrong conclusions. Most prominent examples – response times and interarrival times.

Already in the 1900s, Danish telecommunications engineer A.K. Erlang determined that calls arrive in a Poisson process, and they tend to be processed in times distributed exponentially. That allowed him and other engineers and researchers to build a set of mathematical models of traffic systems with queuing. 

The approach is based on abandoning the utilization approach (a normal-distribution-based approach) and taking advantage of the actual distributions to size the systems based on the probability of all resources being occupied when the next request arrives. This probability is based on the Erlang’s study of arrival rates and response times.

(Today these and derived models work for sizing and predicting overload probabilities in telephony, transportation, fast-food chains, and - lately - more and more in IT; I've seen some research on applying the same principles to supply-demand chain modeling - and there is no reason not to: it's all derived from queuing theory.)

Naturally, for such systems, it becomes important to find outliers that do not belong in the actual, not some hypothetically normal distribution.

In the 1950s, George Box started developing the stochastic theory of process control, which developed into SPC (statistical process control) – all the way to its more recent incarnations, including in Six Sigma, SEDS (more on SEDS and other Dr. Igor Trubin's work in this direction is available in CMG Proceedings; just search his name on the CMG web site or on LinkedIn).

At the same time, Box was working on laying the foundations of the theory of automatic control of large systems (chemical plants, nuclear reactors, food manufacturing facilities).

As part of this work, Box had to be able to identify outliers – data points where something had changed; something was not any longer “as before”. The answer came from John Tukey.  The quantile-analysis approach served him – and since then us – well.

The idea was simple: instead of the mean and standard deviation, use the median and percentiles to quantify the distribution. With the parametric approach, we have to deal with mean, standard deviation, skew (a measure of tail), and kurtosis (a measure of flatness). In a similar manner, we should be able to quantify any distribution based on the median (50th percentile), the first and the third quartiles (25th and 75th percentiles), and a multiplier for the inter-quartile range (the distance between the third and the first quartiles) to determine the tails of the distribution. 

That approach was very promising, as it made all the heavy calculations of the third and fourth central moments of distribution obsolete, and that opened an easy way for any, not necessarily normal, distribution, to be quantified and therefore described and controlled mathematically.

These ideas were so appealing and important to the creators of the R precursors S and SPlus, and later R, that they implemented these techniques in their base packages, and whenever you call the summary() function for any numeric vector, it will return the min, the max, the mean, as well as the 3 quartiles – Q1 (25th percentile), Q2 (median), and Q3 (75th percentile):

> X = rnorm(1000, 1, 0.25)
> print (summary(X))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1285 0.8413 0.9984 0.9993 1.1780 1.7360
The summary() function returns a list, which makes it easy enough to access the parameters.
R also allows us to compute the inter-quartile range, IQR:

> IQR(X)
[1] 0.336351
The basic R graphics also allows us to visualize all that information, and more: the function boxplot(X) draws a box plot, also known as box-and-whisker plot:

 

The thick line in the middle corresponds to the median, and the box itself is sandwitched between Q1 and Q3 of the data.
The whiskers (the dotted lines ending in horizontal tick marks) are computed as 1.5 * IQR(X), and the circular dots that are beyond the whiskers are outliers.
Parenthetically, we can also put the mean and the three-sigma intervals on this plot, if we want to:
> abline (h = mean(X) + 3 * sd(X), col = "red")
> points(mean(X), col = "red", pch = 2)
> abline (h = mean(X) - 3 * sd(X), col = "red")
> legend (x = "bottomright", col =c("black", "red"),
+ lty = c(1, 1), legend = c("quartiles", "mean & sd"))
>

Interestingly, in our case, the upper 3-sigma interval would have ended up ignoring the upper outliers:

 

Now that we have it all described, outlier detection becomes a very simple exercise.
The basic outlier detection function (we can add it to the already familiar foo.r file) will look something like this:

GetOutliers = function(X){
smry = summary(X)
iqr = IQR(X)
upper = smry[[5]] + 1.5 * iqr
lower = smry[[2]] - 1.5 * iqr
UpperOutliers = X[which (X > upper)]
LowerOutliers = X[which (X < lower)]
Outliers = c(LowerOutliers, UpperOutliers)
Outliers
}
However, this function returns both high and low outliers – which may or may not be what we are after.

Another thing about this function that is bad programming style is the fact that we are getting only the values of the outliers, but not their positions in the vector X. If X is only one column in a data frame, and we are looking to do something with all rows where X is an outlier, it is better to make it return the outlier indices.

To make it more flexible, let us keep this function and create another one, which will return indices of outliers, with the user of this function allowed to decide which outliers they are looking for.

GetOutlierIndices = function(X, Upper = TRUE, Lower = TRUE){
# Upper = TRUE; Lower = TRUE ### for debugging
smry = summary(X)
iqr = IQR(X)
upper = smry[[5]] + 1.5 * iqr
lower = smry[[2]] - 1.5 * iqr
UpperOLindices = which (Upper & X > upper)
LowerOLindices = which (Lower & X < lower)
Outliers = c(LowerOLindices, UpperOLindices)
Outliers
}

We set the default behavior to return indices of all outliers. The UpperOLindices and LowerOLindices lines are simply another form of saying,

if(Upper) {
UpperOLindices = which (X > upper)
} else {
UpperOLindices =c ()
}
The same goes for LowerOLindices.

You are welcome to modify and enhance these and any other functions; these are just for illustration of the ideas.

Anecdote
When John Tukey was asked why the whiskers should be at 1.5 * IQR, he said, “1.0 is too small, and 2.0 is too big; it should really be case by case, but let’s keep it at 1.5 until proven otherwise”

IQR and Distribution Tails

Let us look at what happens to a non-normal distribution. Let us pick exponential: as we discussed earlier in this section, time intervals, like response times, as well as – in reliability theory – times before failure – are usually distributed exponentially (with the exception of big averaging intervals, when the Central Limit Theorem forces them to hide their true nature). The IQR methodology does not rely on the data distribution law, and we shall demonstrate how we use it in the next Practical Application:

At some point in my life, I worked in the chemical industry, and we had to ensure that a certain chemical stayed dry for a given period of time (a year). Accumulation of more than 5% moisture was considered a failure of the container, and over time, our quality assurance launched a testing program, where they allocated a given number of containers form each lot and put them into the storage room with a controlled humidity environment.  

We used a technique used in reliability tests called “accelerated aging”, wherein we increased the air humidity, pressure, and temperature in the storage room and used mass transfer equations to evaluate the standard-humidity equivalent lifetime of the samples.

Partial data are in the Moisture.csv file. Over the course of the testing, we accumulated several thousand samples. Eventually, we were able to identify the issues and fix them.

R code to read the data:

Data = read.csv ("CSV/Moisture.csv")
DaysToFail = Data$LifeTime
hist(DaysToFail, cex = 0.75, main = "Histogram of DaysToFail")
print (summary(DaysToFail))
x11()
boxplot (DaysToFail, main = "Boxplot of DaysToFail")
points(mean(DaysToFail), col = "red", pch = 2)
abline (h = mean(DaysToFail) - 3 * sd(DaysToFail), col = "red")
abline (h = mean(DaysToFail) + 3 * sd(DaysToFail), col = "red")
legend (x = "bottomright", col =c("black", "red"), lty = c(1, 1), legend = c("quartiles", "mean & sd"))

Figure 5 a: Histogram of the durability of the packages before the fix was adopted (failure was measured when the moisture content in the packet was higher than permitted by the specifications)

The histogram shows that most of the data are in the 0-100 day range, but the upper tail is still long.

Figure 5 b: Boxplot of the durability of the packages before the fix was adopted (failure was measured when the moisture content in the packet was higher than permitted by the specifications)

The boxplot shows that the three-sigma (three-standard-deviation) range would have taken us into the negative range (on the low-tail side), and that most of the outliers would have been ignored if we had gone the way of the mean and standard deviation. Also notice how different the whiskers are on the upper end of the distribution than on its lower end.

The goal of the project was to move the regular data range to ensure that samples that survived longer than a year were not outliers; so we measured success by the number of outliers that survived longer than 365 days.

Success was demonstrated when we collected another 1000 samples after the fix was implemented. Data are recorded in the MoistureAfter.csv file, which is available upon request.

Applying the same code to the file, we get the histogram and the boxplot:

 

Figure 5 c: Histogram of the durability of the packages after the fix was implemented 


Figure 5 d: Boxplot of the durability of the packages after the fix was implemented

Back in the days, R had not yet been invented (nor even S or SPlus), and all the programming was done in C for this project, which was a cumbersome task, especially when it came to plotting the data.
Now with R, it is much easier. Let us see how we can improve that code, to make it more flexible.

First, we can gather the data from both files into one data frame, like we did in the ANOVA example:
Data1 = read.csv ("CSV/Moisture.csv")
Data2 = read.csv ("CSV/MoistureAfter.csv")

Because of the exponential nature of the distribution of the DaysToFail variable, the lower part (corresponding to treatment = “None”) will dissolve in the higher part, and we will not see them in one boxplot. Again, it is R to the rescue:

Data1$treatment = "None"
Data2$treatment = "New"
Data = rbind(Data1, Data2)
x11()
boxplot (LifeTime ~ treatment, data = Data, main = "Boxplot of DaysToFail")

This line of code produces a much more informative picture, which can even be taken to the executives:
 


Figure 5 e: Boxplot of the durability of the packages after (“New”) and before (“None”) the fix was implemented

Note: R orders boxplots by the value of the X axis; the word “New” comes before the “None”; so it puts “None” to the right of the “New”. If you know of ways of changing the order in which the boxplots are shown here, let me know!

Of course a picture is worth a thousand words, but a number is worth even more than that. Because these data are not normally distributed, we should not be using the T-test here. That is why we decided to measure success by the number of outliers in the “New” and “None” data sets.

This is how it would be done in R:

Trtmts = levels(factor(Data$treatment))
TrtmtResponse = lapply(1:length(Trtmts), function (ti, moisture, trtmts){
source("foo.r")
DF = moisture[which (moisture$treatment == trtmts[ti]),]
Outliers = GetOutlierIndices (X = as.numeric(DF$LifeTime),
Upper = TRUE, Lower = FALSE)
### We are not interested in lower outliers
return (DF$LifeTime[Outliers])
}, moisture = Data, trtmts = Trtmts
)

This code will produce a list TrtmtResponse (we used the lapply function), indexed by the index of the Trtmts vector. To make it more user-friendly, we shall give names to the TrtmtReponse corresponding to their indices.

names (TrtmtResponse) = Trtmts
TrtmtResponse$New
TrtmtResponse$None
boxplot(TrtmtResponse); mtext(“Long-Living Samples”)

Figure 5 f: comparison of outliers after (“New”) and before (“None”) the fix was implemented

As we can see, the outliers are much higher in the “New” case than in the “None” case. We are interested in outliers that live longer than 365 days, and we have all the data for that:

ResilientSamples.New = TrtmtResponse$New[which (TrtmtResponse$New > 365)]
ResilientSamples.Old = TrtmtResponse$None[which (TrtmtResponse$None > 365)]
if (length(ResilientSamples.New) > length(ResilientSamples.Old))
print ("Improved") else print ("No Improvement")

Again, we are not using the T-test on the sizes of the resilient samples, because we cannot and should not assume that they come from normal distribution: we are cutting off the values at a given number (the top of the upper whisker = 75th percentile + 1.5 * IQR), and that makes them non-normal by definition.

Now it is merely a matter of programming to put it all into an R function and to use it whenever similar analysis is necessary:

IsNewBetter = function (Data, howLong2Live = 365){
Trtmts = levels(factor(Data$treatment))
TrtmtResponse = lapply(1:length(Trtmts),
function (ti, moisture, trtmts){
source ("foo.r")
DF = moisture [which (moisture$treatment == trtmts[ti]),]
Outliers = GetOutlierIndices (X = as.numeric(DF$LifeTime),
Upper = TRUE, Lower = FALSE)
### We are not interested in lower outliers
return (DF$LifeTime[Outliers])
}, moisture = Data, trtmts = Trtmts
)
names (TrtmtResponse) = Trtmts
TrtmtResponse$New
TrtmtResponse$None
boxplot(TrtmtResponse); mtext("Long-Living Samples")
ResilientSamples.New =
TrtmtResponse$New[which (TrtmtResponse$New > howLong2Live)]
ResilientSamples.Old =
TrtmtResponse$None[which(TrtmtResponse$None > howLong2Live)]
HowMany.Old = length (ResilientSamples.Old)
HowMany.New = length (ResilientSamples.New)
Improvement = (HowMany.New - HowMany.Old) / HowMany.Old
return (Improvement)
}
In this example, we measure improvement (last 2 lines in the function definition) as a relative increase in the number of the resilient samples (high outliers with magnitude greater than a given number), because that is what we had agreed to use as the measure of success.  

One other way to do it would be to include the magnitude of the outliers in the measure of improvement. Selection of the best way to measure success varies by use case, and it should be discussed with the subject-matter experts before any code is produced.
 
Conclusions
Outliers are not a nuisance. Outliers do not have to always be as low in magnitude as possible. In this example, we demonstrated how outliers can be used to measure important process metrics. In a later section, we shall see more use of outliers in process diagnostics. We have also seen how to use the IQR methodology to detect outliers.  

Whenever possible (i.e., whenever you have access to R), do not use the mean-and-standard-deviation approach for outlier detection, but use the non-parametric IQR method: it is robust and does not produce false negatives.



(To be continued...)

No comments:

Post a Comment