Host CPU utilization
We already discussed it in the earlier posts (R and Outliers Part 2), when we used the ANOVA to determine whether all the hosts are utilized
uniformly. However, one of the
limitations of the F-test is its requirement that the data should follow the
normal distribution. Since we cannot
necessarily assume that it is so, we have to use a non-parametric method for the task. Basic R does not do that. There are a number of packages in the Cyberspace, but they all require a non-parametric version of Tukey's post-hoc analysis.
IQR methodology, on the other hand, is
much more robust and simple than the F-test, and the IQR approach gives immediate answers (i.e., no post-hoc test is required).
Practical Application
4: Uniformity analysis
The problem formulation here is slightly different than in
Practical Application 2 (R and Outliers, Part 2): we do not have thousands of data points for each
pool. We have 75 hosts within one pool,
performing the same set of applications at the same time of the day. For each host, the CPU utilization data are
collected with 5-minute intervals, so we have 75 sets of 12 data points. We need to find out whether any of the hosts
are “misbehaving”.
The data file (CPU_CHECK_FOR_OUTLIERS.csv) is available upon request.
The MyPool data frame will then
have two columns, the CPU utilization and the Host ID (a number)
EDA:
>
boxplot(CPU ~ Host, data = MyPool)
>
>
x11()
>
boxplot(MyPool$CPU)
Figure
6 a: CPU utilization within the pool of servers.
The
large hourly variance for each server makes outlier detection
difficult
From the
boxplot, we see that there is a group of hosts that are behaving differently
than the rest of them. It is also
obvious that direct application of outliers will not give us much
benefit:
>
summary(MyPool$CPU)
Min.
1st Qu. Median Mean
3rd Qu. Max.
0.0003825
0.0908900 0.2194000 0.2851000 0.4250000 0.9909000
>
IQR(MyPool$CPU)
[1]
0.3341146
>
quantile(MyPool$CPU)
0% 25% 50% 75% 100%
0.0003825315
0.0908883282 0.2193659367 0.4250029765 0.9909475488
>
The
whiskers:
>iqr = IQR(MyPool$CPU)
>qrtls = quantile(MyPool$CPU)
>hi
= qrtls[[4]] + 1.5 * iqr
>hi
[1]
0.926175
We see that very few outliers would be captured if we
followed the direct approach in this case, primarily due to the wide range of
CPU utilization within each host: CPU utilization cannot be greater than 100%,
and the whiskers say that it is not an outlier unless it is higher than 92%.
However, we can use a different approach: we compare
percentile with percentile.
>
Medians = aggregate (CPU ~ Host, data = MyPool, FUN =
"median")
> boxplot
(CPU ~ Host, data = Medians, ylim = c(0,
1))
Figure
6 b: Median CPU utilization for the same pool of servers
That promises to be much more manageable:
>
iqr = IQR(Medians$CPU)
>
qrtls = quantile(Medians$CPU)
>
hi = qrtls[[4]] + 1.5 * iqr
>
hi
[1]
0.5097297
Any host whose median CPU utilization within that hour was
higher than 51% utilized is an outlier.
>
hotHosts = which (Medians$CPU > hi)
>
Medians[hotHosts,]
Host
CPU
13 13
0.5459696
16 16
0.5439981
17 17
0.5209497
18 18
0.5420463
21 21
0.6609281
We were able to detect the five hosts that are definite
outliers: 13, 16, 17, 18, and 21.
But this still leaves some hosts undetected: we see that
there are more than just these 5 that are together. To improve rate of success, we can choose one
of three possible ways:
1
Recall the anecdote
about George Box and the 1.5: since we are using a central measure of the
distribution for each host, we cannot demand as much certainty as we would if we
were looking at the entire data set.
2
Use other percentiles as well (the most obvious would be Q1
and Q3): if in any of the 3 quartiles (Q1, median, and Q3) we found an outlier,
chances are it is an outliers. A number
of options are possible in this approach.
Finally, we can look for clusters centered around the 5
outliers we found.
This is one of those cases where pure statistics does not
give an immediate solution: we found statistically significant outliers, but which are practically significant? In such
events, it is advisable to consult the SMEs (subject-matter experts) on what
they consider outliers and adjust the program accordingly.
For example, if the domain experts were able to
identify 11 hosts (13-23) that had been allocated for testing a new
application, then adjusting the IQR
multiplier to 0.6 would the 11 hosts, but that would also include host # 52, which
the SME knew nothing about.
>
hi = qrtls[[4]] + 0.6 * iqr
>
hi
[1]
0.384761
>
>
hotHosts = which (Medians$CPU > hi)
>
Medians[hotHosts,]
Host
CPU
13 13
0.5459696
14 14
0.4057436
15 15
0.3959872
16 16
0.5439981
17 17
0.5209497
18 18
0.5420463
19 19
0.4291690
20 20
0.4733736
21 21
0.6609281
23 23
0.4144875
52 52
0.3977826
Host
52 would then be taken out of the production workflow for
repairs.
Conclusion
In
this scenario, we have been able to use the same methodology as we used to find
singular outliers for detection of groups of outliers. Aggregating the data by
medians has enabled us to more easily focus on outliers when the variation
within the unit (host) is comparable with overall variability. Other measures
than medians could have been used, and this method can also be extended to
several quantiles (e.g., 10th percentile, 25th, median, 75th, and 90th) by
comparing each of the percentiles independently and identifying the units
(hosts) where all 5 are outliers together. That would allow us to more directly zero in on the "true" outlier(s).
(To be continued...)
(To be continued...)
No comments:
Post a Comment