"There is no such thing as an outlier. They are all events"
lot has been said and written about time series analysis (also known as
TSA). Methods, such as ARIMA (AutoRegressive Integrated Moving Average) and EWMA
(Exponentially Weighted Moving Average) are well represented and described in R,
especially in packages zoo,
xts, and forecast. These tools allow user to fit models into
trended and periodic data and even forecast the data outside the time range for
which the data exist.
And yet none of these packages detect events in data. There are two types of events, an event
proper (aka “a spike” or “a drop”) and a level shift. Some events are periodic, while others are
random (so called shot noise – reader is referred to other sources for
information on shot noise). But at the
basis of all event detection is the material we have already covered in this
chapter – all outliers in time-series data are events, and all events will, one
way or another, show as outliers in time-series data.
We will here analyze two practical applications where event
identification can be demonstrated on real-world examples.
Practical Application 5:
Bad Transaction identification
In web site operations – whether they are cloud-based, or
client-server, or even out of someone’s garage – different applications interact
via a request-response loop: a client application calls a server application,
receives data, and closes the connection.
The frequency of such requests and the average duration of the
interaction (response time), combined together, define the traffic – the number
of concurrent connections – which is the key performance metric determining the
size and the number of the servers to be provisioned.
It goes without saying that over-provisioning will drag the
web site operator into unnecessary capital expenditures and operating costs,
whereas undersizing will lead to undesirable consequences in site availability,
customer satisfaction, etc.
Consequently, we need to be able to identify when we are dealing with
outliers and when the strange-looking data points are part of a pattern.
The data for this Practical Application are saved in the
ResponseTimes.csv file (as always, available upon request). The code is very
simple:
resp
= read.csv ("ResponseTimes.csv", strip.white =
TRUE)
summary(resp)
source
("foo.r")
# We use the foo.r
file again: we already have outlier detection there
RespTimeOutliers
= GetOutlierIndices(
X = resp$RespTime, Upper = TRUE, Lower = TRUE, PlotTF = TRUE)
This will produce the list of all outliers and a boxplot:
Figure
7: Application with two different transactions.
(The
data are in milliseconds) Or are they outliers?
We do not want to size for outliers (see definition of
outlier in the beginning of this chapter), but if the outliers are all grouped
around a center or two, we may need to separate the application into two: one
with fast-turnaround transactions and the other with slow transactions. In this case, the data seem to fall into two
clusters, and we can use the standard kmeans()
function from the R distribution to identify them:
library(utils)
clusters
= kmeans(x = resp$RespTime,
centers = 2)
The kmeans() function
produces a list of objects, one of which is the cluster number.
resp$cl
= clusters$cluster
Parenthetically
speaking, the k-means method, like most methods relying on a normal
distribution, is not the best possible way to detect clusters; but for
the initial EDA (Exploratory Data Analysis), it will do. We have to be
very careful when to use k-means clustering; but if we have many data
points for each response time, it may be said that the Central Limit
Theorem will make the data appear normally distributed for the data
points that we are looking at.
The plot proves what we suspected:
Figure 8: by using the k-means clustering, we were able
to identify a hidden pattern in the data:
We really are dealing with
two processes in this application, and in this particular case, it is better to
separate the two processes into different applications, because the difference
between the two is too big.
What we initially saw as outliers was associated with the
longer-lived transactions returning the data.
The ResponseTimes.csv file contains five hours’ worth of response times
data averaged for every 5 seconds.
Ninety-eight out of the 3600 five-second intervals contained enough such
events that the average response time for these intervals was much higher than
for the rest of the data.
Clustering is
a very efficient technique here that can be used to detect patterns in such
data.
Interestingly enough, if we had a
continuous line of outliers in Figure 9, clustering would possibly be
misleading, as these outliers would merely indicate that we have a response time
distribution with a long tail on the right-hand side of the histogram. For an example, look at the Practical
Application 6.
Practical Application
6: Behavior change detection
When analyzing time series, it is important to be able to
detect points of sharp change in behavior, like level shifts and local
trends. While the xts and zoo packages allow the user, among other
things, to apply functions, like cor.test(), in a rolling manner to detect
local trends, these are typically slow.
A faster way to detect trend changes will be demonstrated here, but first
let us review level shifts.
For an example, we will go to our old friend, application
response time. We had an actual
application that was chugging along just fine at around 100 milliseconds, until
a new product version was released. The moment it happened, its response time
jumped to 170 milliseconds. The data
were recorded every minute and are available from me upon request.
Figure 9:
Application response time increased as a level shift
By eyeballing, it is obvious what happened: we had a
practically flat line, then a level shift, and then a nearly flat lie
again. In mathematical terms, we need to
look at the first derivative (rate of change).
To make it easier to detect, we first take the hourly rolling mean of the
response times:
library(zoo)
hourly
= rollmean(Resp, 60)
#
rollmean() from the zoo package applies
#
the rolling mean (running average) function to
data
x11();
par(mfcol = c(1,2))
plot(hourly,
pch = "", xlab = ""); lines(hourly);
mtext("Application
X hourly average response time", cex = 0.7)
boxplot(hourly);
mtext("Application
X hourly average response time", cex = 0.7)
This code will produce a more comprehensible plot:
Figure 9a: hourly
averages
Then we
use the diff () function, since we know that it is
a time series with equal time intervals, and the data are already sorted by the
timestamps. (We leave it for the reader
to make this code less sensitive to variations in time intervals)
deltaR
= diff(hourly)
plot(deltaR,
pch = "", xlab = ""); lines(deltaR);
mtext("Hourly avg
response time Deltas", cex = 0.85)
Figure 10: the diff() function applied to hourly
averages highlight the position of the level shift
We see that the diff() produced data that are a lot
more noisy than the original data: we see outliers in the proximity of the main
distribution and far away from it. The
IQR method is obviously too sensitive.
Fortunately, we remember what George Box said about the multiplier for on
IQR: it is dictated by what makes sense.
We are not interested in how the application behaved before or how it
behaved after the level shift; only looking for the big change (gross outliers);
changing the last line to boxplot(deltaR, range = 3.0) does
the trick:
Figure 10 a: Level
shift detection complete.
The rest is a matter of technique: we can now apply our
function GetOutlierIndices (…) that we
defined in the foo.r file. However, we need to modify it to allow
flexibility in selecting the range for IQR, which is easy enough to do in the function definition.
Conclusions
We have demonstrated, on real-world examples from IT and from
Chemical Industry, how outliers can be detected using the basic R
functionality. We have seen that outliers
are not just a nuisance that the world is better off without, but can provide a
wealth of useful, actionable information to the analyst.
Along the way, we went over some basic but useful R language
specifics.
Needless to say, in detecting and making conclusions about
outliers, care must be taken to ensure that the statistical assumptions about
the data hold true and the correct statistical tests are used with the data.
One thing that is very important to remember is that, while R is a very
comprehensive tool, it is just a tool nevertheless, and it is up to the user to
apply it correctly. R is not an AI
language, and until it becomes one, it cannot possibly be smarter than the
analyst using it.
No comments:
Post a Comment