Sunday, November 18, 2012

R and Outliers. Part 4: Uniformity Analysis

Host CPU utilization

We already discussed it in the earlier posts (R and Outliers Part 2), when we used the ANOVA to determine whether all the hosts are utilized uniformly. However, one of the limitations of the F-test is its requirement that the data should follow the normal distribution. Since we cannot necessarily assume that it is so, we have to use a non-parametric method for the task.  Basic R does not do that.   There are a number of packages in the Cyberspace, but they all require a non-parametric version of Tukey's post-hoc analysis.
 
IQR methodology, on the other hand, is much more robust and simple than the F-test, and the IQR approach gives immediate answers (i.e., no post-hoc test is required).  

Practical Application 4: Uniformity analysis
The problem formulation here is slightly different than in Practical Application 2 (R and Outliers, Part 2): we do not have thousands of data points for each pool. We have 75 hosts within one pool, performing the same set of applications at the same time of the day. For each host, the CPU utilization data are collected with 5-minute intervals, so we have 75 sets of 12 data points. We need to find out whether any of the hosts are “misbehaving”.


The MyPool data frame will then have two columns, the CPU utilization and the Host ID (a number)
EDA:

> boxplot(CPU ~ Host, data = MyPool)
>
> x11()
> boxplot(MyPool$CPU)


Figure 6 a: CPU utilization within the pool of servers.
The large hourly variance for each server makes outlier detection difficult

From the boxplot, we see that there is a group of hosts that are behaving differently than the rest of them. It is also obvious that direct application of outliers will not give us much benefit:

> summary(MyPool$CPU)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0003825 0.0908900 0.2194000 0.2851000 0.4250000 0.9909000
> IQR(MyPool$CPU)
[1] 0.3341146

> quantile(MyPool$CPU)
0% 25% 50% 75% 100%
0.0003825315 0.0908883282 0.2193659367 0.4250029765 0.9909475488
>

The whiskers:
>iqr = IQR(MyPool$CPU)
>qrtls = quantile(MyPool$CPU)
>hi = qrtls[[4]] + 1.5 * iqr
>hi

[1] 0.926175
We see that very few outliers would be captured if we followed the direct approach in this case, primarily due to the wide range of CPU utilization within each host: CPU utilization cannot be greater than 100%, and the whiskers say that it is not an outlier unless it is higher than 92%.
However, we can use a different approach: we compare percentile with percentile.

> Medians = aggregate (CPU ~ Host, data = MyPool, FUN = "median")
> boxplot (CPU ~ Host, data = Medians, ylim = c(0, 1))


Figure 6 b: Median CPU utilization for the same pool of servers
That promises to be much more manageable:
> iqr = IQR(Medians$CPU)
> qrtls = quantile(Medians$CPU)
> hi = qrtls[[4]] + 1.5 * iqr
> hi
[1] 0.5097297

Any host whose median CPU utilization within that hour was higher than 51% utilized is an outlier.
> hotHosts = which (Medians$CPU > hi)
> Medians[hotHosts,]
Host CPU
13 13 0.5459696
16 16 0.5439981
17 17 0.5209497
18 18 0.5420463
21 21 0.6609281

We were able to detect the five hosts that are definite outliers: 13, 16, 17, 18, and 21.
But this still leaves some hosts undetected: we see that there are more than just these 5 that are together. To improve rate of success, we can choose one of three possible ways:
1 
      Recall the anecdote about George Box and the 1.5: since we are using a central measure of the distribution for each host, we cannot demand as much certainty as we would if we were looking at the entire data set.
2    
          Use other percentiles as well (the most obvious would be Q1 and Q3): if in any of the 3 quartiles (Q1, median, and Q3) we found an outlier, chances are it is an outliers. A number of options are possible in this approach.
      Finally, we can look for clusters centered around the 5 outliers we found.

This is one of those cases where pure statistics does not give an immediate solution: we found statistically significant outliers, but which are practically significant? In such events, it is advisable to consult the SMEs (subject-matter experts) on what they consider outliers and adjust the program accordingly.

For example, if the domain experts were able to identify 11 hosts (13-23) that had been allocated for testing a new application, then adjusting the IQR multiplier to 0.6 would the 11 hosts, but that would also include host # 52, which the SME knew nothing about.

> hi = qrtls[[4]] + 0.6 * iqr
> hi
[1] 0.384761
>
> hotHosts = which (Medians$CPU > hi)
> Medians[hotHosts,]
Host CPU
13 13 0.5459696
14 14 0.4057436
15 15 0.3959872
16 16 0.5439981
17 17 0.5209497
18 18 0.5420463
19 19 0.4291690
20 20 0.4733736
21 21 0.6609281
23 23 0.4144875
52 52 0.3977826
Host 52 would then be taken out of the production workflow for repairs.

Conclusion

In this scenario, we have been able to use the same methodology as we used to find singular outliers for detection of groups of outliers. Aggregating the data by medians has enabled us to more easily focus on outliers when the variation within the unit (host) is comparable with overall variability. Other measures than medians could have been used, and this method can also be extended to several quantiles (e.g., 10th percentile, 25th, median, 75th, and 90th) by comparing each of the percentiles independently and identifying the units (hosts) where all 5 are outliers together. That would allow us to more directly zero in on the "true" outlier(s).

(To be continued...)

No comments:

Post a Comment