The Inter-Quantile Range
In this section, we shall tackle the most important question
with respect to outliers: what happens “if X is NOT a normally distributed
random variable”.
While an enormous number of distributions can be considered
normal, some do not fit into the assumption of the formalized mathematical
tests, so much so that they can very easily mislead the unsuspecting analyst
into the wrong conclusions. Most
prominent examples – response times and interarrival times.
Already in the 1900s, Danish telecommunications engineer
A.K. Erlang determined that calls arrive in a Poisson process, and
they tend to be processed in times distributed exponentially. That allowed him and other engineers and researchers to build a set of mathematical
models of traffic systems with queuing.
The approach is based on abandoning the utilization approach
(a normal-distribution-based approach) and taking advantage of the actual
distributions to size the systems based on the probability of all resources
being occupied when the next request arrives.
This probability is based on the Erlang’s study of arrival rates and
response times.
(Today these and derived
models work for sizing and predicting overload probabilities in
telephony, transportation, fast-food chains, and - lately - more and
more in IT; I've seen some research on applying the same principles to
supply-demand chain modeling - and there is no reason not to: it's all
derived from queuing theory.)
Naturally, for such systems, it becomes important to find
outliers that do not belong in the actual, not some hypothetically normal
distribution.
In the 1950s, George Box started developing the stochastic
theory of process control, which developed into SPC (statistical process
control) – all the way to its more recent incarnations, including in Six Sigma, SEDS (more on SEDS and other Dr. Igor Trubin's work in this direction is available in CMG Proceedings; just search his name on the CMG web site or on LinkedIn).
At the same time, Box was working on laying the foundations of the theory of automatic control of large systems (chemical plants, nuclear
reactors, food manufacturing facilities).
As part of this work, Box had to be able to identify outliers
– data points where something had changed; something was not any longer “as
before”. The answer came from John Tukey. The
quantile-analysis approach served him – and since then us – well.
The idea was simple: instead of the mean and standard
deviation, use the median and percentiles to quantify the distribution. With the parametric approach, we have to deal
with mean, standard deviation, skew (a measure of tail), and kurtosis (a measure
of flatness). In a similar manner, we
should be able to quantify any distribution based on the median (50th
percentile), the first and the third quartiles (25th and
75th percentiles), and a multiplier for the inter-quartile range (the
distance between the third and the first quartiles) to determine the tails of
the distribution.
That approach was very
promising, as it made all the heavy calculations of the third and fourth central
moments of distribution obsolete, and that opened an easy way for any, not
necessarily normal, distribution, to be quantified and therefore described and
controlled mathematically.
These ideas were so appealing and important to the creators
of the R precursors S and SPlus, and later R, that they implemented these
techniques in their base packages, and whenever you call the summary() function for any numeric vector, it will return the
min, the max, the mean, as well as the 3 quartiles – Q1 (25th
percentile), Q2 (median), and Q3 (75th percentile):
>
X = rnorm(1000, 1, 0.25)
>
print (summary(X))
Min. 1st Qu. Median
Mean 3rd Qu. Max.
0.1285 0.8413 0.9984
0.9993 1.1780 1.7360
The summary() function returns a
list, which makes it easy enough to access the parameters.
R also allows us to compute the inter-quartile range,
IQR:
>
IQR(X)
[1]
0.336351
The basic R graphics also allows us to visualize all that
information, and more: the function boxplot(X)
draws a box plot, also known as box-and-whisker plot:
The thick line in the middle corresponds to the median, and
the box itself is sandwitched between Q1 and Q3 of the data.
The whiskers (the dotted lines ending in horizontal tick
marks) are computed as 1.5
* IQR(X), and the circular dots that are beyond the whiskers are
outliers.
Parenthetically, we can also put the mean and the three-sigma
intervals on this plot, if we want to:
>
abline (h = mean(X) + 3 * sd(X), col = "red")
>
points(mean(X), col = "red", pch =
2)
>
abline (h = mean(X) - 3 * sd(X), col = "red")
>
legend (x = "bottomright",
col =c("black", "red"),
+ lty = c(1, 1), legend = c("quartiles", "mean & sd"))
>
Interestingly, in our case, the upper 3-sigma interval would
have ended up ignoring the upper outliers:
Now that we have it all described, outlier detection becomes
a very simple exercise.
The basic outlier detection function (we can add it to the
already familiar foo.r file) will look something like this:
GetOutliers
= function(X){
smry = summary(X)
iqr = IQR(X)
upper = smry[[5]] + 1.5 * iqr
lower = smry[[2]] - 1.5 * iqr
UpperOutliers =
X[which (X > upper)]
LowerOutliers =
X[which (X < lower)]
Outliers = c(LowerOutliers, UpperOutliers)
Outliers
}
However, this function returns both high and low outliers –
which may or may not be what we are after.
Another thing about this function that is bad programming
style is the fact that we are getting only the values of the outliers, but not
their positions in the vector X. If X is
only one column in a data frame, and we are looking to do something with all
rows where X is an outlier, it is better to make it return the outlier
indices.
To make it more flexible, let us keep this function and
create another one, which will return indices of outliers, with the user of this
function allowed to decide which outliers they are looking for.
GetOutlierIndices =
function(X, Upper = TRUE, Lower = TRUE){
#
Upper = TRUE; Lower = TRUE ###
for debugging
smry = summary(X)
iqr = IQR(X)
upper = smry[[5]] + 1.5 * iqr
lower = smry[[2]] - 1.5 * iqr
UpperOLindices =
which (Upper & X > upper)
LowerOLindices =
which (Lower & X < lower)
Outliers = c(LowerOLindices, UpperOLindices)
Outliers
}
We set the default behavior to return indices of all
outliers. The UpperOLindices and LowerOLindices
lines are simply another form of saying,
if(Upper)
{
UpperOLindices =
which (X > upper)
}
else {
UpperOLindices
=c ()
}
The same goes for LowerOLindices.
You are welcome to modify and enhance these and any other functions; these are just for illustration of the ideas.
Anecdote
When John Tukey was asked why the whiskers should be at 1.5 *
IQR, he said, “1.0 is too small, and 2.0 is too big; it should really be case by
case, but let’s keep it at 1.5 until proven otherwise”
IQR and Distribution Tails
Let us look at what happens to a non-normal
distribution. Let us pick
exponential: as we discussed earlier in
this section, time intervals, like response times, as well as – in reliability
theory – times before failure – are usually distributed exponentially (with the exception of big averaging
intervals, when the Central Limit Theorem forces them to hide their true
nature). The IQR methodology does not
rely on the data distribution law, and we shall demonstrate how we use it in the
next Practical Application:
At some point in my life, I worked in the chemical industry, and we had to ensure that
a certain chemical stayed dry for a given period of time (a year). Accumulation of more than 5% moisture was
considered a failure of the container, and over time, our quality assurance
launched a testing program, where they allocated a given number of containers
form each lot and put them into the storage room with a controlled humidity
environment.
We used a technique used in
reliability tests called “accelerated aging”, wherein we increased the air
humidity, pressure, and temperature in the storage room and used mass transfer equations to evaluate the
standard-humidity equivalent lifetime of the samples.
Partial data are in the Moisture.csv file. Over the course of the testing, we
accumulated several thousand samples.
Eventually, we were able to identify the issues and fix them.
R code to read the data:
Data
= read.csv ("CSV/Moisture.csv")
DaysToFail
= Data$LifeTime
hist(DaysToFail,
cex = 0.75, main = "Histogram of DaysToFail")
print
(summary(DaysToFail))
x11()
boxplot
(DaysToFail, main = "Boxplot
of DaysToFail")
points(mean(DaysToFail), col = "red", pch = 2)
abline
(h = mean(DaysToFail) - 3 * sd(DaysToFail), col =
"red")
abline
(h = mean(DaysToFail) + 3 * sd(DaysToFail), col =
"red")
legend
(x = "bottomright", col =c("black", "red"), lty = c(1, 1), legend =
c("quartiles", "mean & sd"))
Figure 5 a: Histogram
of the durability of the packages before the fix was adopted (failure was
measured when the moisture content in the packet was higher than permitted by
the specifications)
The histogram shows that most of the data are in the 0-100
day range, but the upper tail is still long.
Figure 5 b: Boxplot
of the durability of the packages before the fix was adopted (failure was
measured when the moisture content in the packet was higher than permitted by
the specifications)
The boxplot shows that the three-sigma
(three-standard-deviation) range would have taken us into the negative range (on
the low-tail side), and that most of the outliers would have been ignored if we
had gone the way of the mean and standard deviation. Also notice how different the whiskers are on
the upper end of the distribution than on its lower end.
The goal of the project was to move the regular data range to
ensure that samples that survived longer than a year were not outliers; so we
measured success by the number of outliers that survived longer than 365
days.
Success was demonstrated when we collected another 1000
samples after the fix was implemented.
Data are recorded in the MoistureAfter.csv file, which is available upon request.
Applying the same code to the file, we get
the histogram and the boxplot:
Figure 5
c: Histogram of the durability of the packages after the fix was implemented
Figure 5
d: Boxplot of the durability of the packages after the fix was implemented
Back in the days, R had not yet been invented (nor even S or
SPlus), and all the programming was done in C for this project, which was a
cumbersome task, especially when it came to plotting the data.
Now with R, it is much easier. Let us see how we can improve that code, to
make it more flexible.
First, we can gather the data from both files into one data
frame, like we did in the ANOVA example:
Data1
= read.csv ("CSV/Moisture.csv")
Data2
= read.csv ("CSV/MoistureAfter.csv")
Because of the exponential nature of the distribution of the
DaysToFail variable, the lower part (corresponding to
treatment
= “None”) will dissolve in the higher part, and we will not see them
in one boxplot. Again, it is R to the
rescue:
Data1$treatment
= "None"
Data2$treatment
= "New"
Data
= rbind(Data1, Data2)
x11()
boxplot
(LifeTime ~ treatment, data = Data, main = "Boxplot of DaysToFail")
This line of code produces a much more informative picture,
which can even be taken to the executives:
Figure 5
e: Boxplot of the durability of the packages after (“New”) and before (“None”)
the fix was implemented
Note: R orders boxplots by the value of the X axis; the word
“New” comes before the “None”; so it puts “None” to the right of the “New”. If you know of ways of changing the order in which the boxplots are shown here, let me know!
Of course a picture is worth a thousand words, but a number
is worth even more than that. Because
these data are not normally distributed, we should not be using the T-test
here. That is why we decided to measure
success by the number of outliers in the “New” and “None” data sets.
This is how it would be done in R:
Trtmts
= levels(factor(Data$treatment))
TrtmtResponse
= lapply(1:length(Trtmts), function (ti, moisture, trtmts){
source("foo.r")
DF = moisture[which (moisture$treatment
== trtmts[ti]),]
Outliers = GetOutlierIndices (X = as.numeric(DF$LifeTime),
Upper = TRUE,
Lower = FALSE)
### We are not interested in lower
outliers
return
(DF$LifeTime[Outliers])
}, moisture = Data, trtmts =
Trtmts
)
This code will produce a list TrtmtResponse (we used the
lapply function), indexed by the index of the Trtmts vector. To make it more user-friendly, we shall give
names to the TrtmtReponse corresponding to their
indices.
names
(TrtmtResponse) = Trtmts
TrtmtResponse$New
TrtmtResponse$None
boxplot(TrtmtResponse);
mtext(“Long-Living Samples”)
Figure 5
f: comparison of outliers after (“New”) and before (“None”) the fix was
implemented
As we can see, the outliers are much higher in the “New” case
than in the “None” case. We are
interested in outliers that live longer than 365 days, and we have all the data
for that:
ResilientSamples.New
= TrtmtResponse$New[which (TrtmtResponse$New >
365)]
ResilientSamples.Old
= TrtmtResponse$None[which (TrtmtResponse$None >
365)]
if
(length(ResilientSamples.New) > length(ResilientSamples.Old))
print
("Improved") else print ("No Improvement")
Again, we are not using the T-test on the sizes of the
resilient samples, because we cannot and should not assume that they come from
normal distribution: we are cutting off the values at a given number (the top of the upper whisker = 75th percentile + 1.5 * IQR), and that
makes them non-normal by definition.
Now it is merely a matter of programming to put it all into
an R function and to use it whenever similar analysis is necessary:
IsNewBetter
= function (Data, howLong2Live = 365){
Trtmts = levels(factor(Data$treatment))
TrtmtResponse = lapply(1:length(Trtmts),
function
(ti, moisture, trtmts){
source ("foo.r")
DF = moisture [which (moisture$treatment == trtmts[ti]),]
Outliers = GetOutlierIndices (X = as.numeric(DF$LifeTime),
Upper
= TRUE, Lower = FALSE)
### We are not interested in lower
outliers
return (DF$LifeTime[Outliers])
}, moisture = Data, trtmts =
Trtmts
)
names
(TrtmtResponse) = Trtmts
TrtmtResponse$New
TrtmtResponse$None
boxplot(TrtmtResponse); mtext("Long-Living
Samples")
ResilientSamples.New =
TrtmtResponse$New[which (TrtmtResponse$New >
howLong2Live)]
ResilientSamples.Old =
TrtmtResponse$None[which(TrtmtResponse$None >
howLong2Live)]
HowMany.Old = length
(ResilientSamples.Old)
HowMany.New = length
(ResilientSamples.New)
Improvement = (HowMany.New - HowMany.Old)
/ HowMany.Old
return
(Improvement)
}
In this example, we measure improvement (last 2 lines in the
function definition) as a relative increase in the number of the resilient
samples (high outliers with magnitude greater than a given number), because that
is what we had agreed to use as the measure of success.
One other way to do it would be to include
the magnitude of the outliers in the measure of improvement. Selection of the best way to measure success
varies by use case, and it should be discussed with the subject-matter experts
before any code is produced.
Conclusions
Outliers are not a nuisance.
Outliers do not have to always be as low in magnitude as possible. In this example, we demonstrated how outliers
can be used to measure important process metrics. In a later section, we shall see more use of
outliers in process diagnostics. We have
also seen how to use the IQR methodology to detect outliers.
Whenever possible (i.e., whenever you have
access to R), do not use the mean-and-standard-deviation approach for outlier detection, but use the
non-parametric IQR method: it is robust and does not produce false
negatives.
(To be continued...)
No comments:
Post a Comment