Monday, November 19, 2012

R and Outliers. Part 5: Time is a "Series" Matter

"There is no such thing as an outlier.  They are all events"

lot has been said and written about time series analysis (also known as TSA). Methods, such as ARIMA (AutoRegressive Integrated Moving Average) and EWMA (Exponentially Weighted Moving Average) are well represented and described in R, especially in packages zoo, xts, and forecast. These tools allow user to fit models into trended and periodic data and even forecast the data outside the time range for which the data exist.
And yet none of these packages detect events in data. There are two types of events, an event proper (aka “a spike” or “a drop”) and a level shift. Some events are periodic, while others are random (so called shot noise – reader is referred to other sources for information on shot noise). But at the basis of all event detection is the material we have already covered in this chapter – all outliers in time-series data are events, and all events will, one way or another, show as outliers in time-series data.
We will here analyze two practical applications where event identification can be demonstrated on real-world examples.
Practical Application 5: Bad Transaction identification
In web site operations – whether they are cloud-based, or client-server, or even out of someone’s garage – different applications interact via a request-response loop: a client application calls a server application, receives data, and closes the connection. The frequency of such requests and the average duration of the interaction (response time), combined together, define the traffic – the number of concurrent connections – which is the key performance metric determining the size and the number of the servers to be provisioned.

It goes without saying that over-provisioning will drag the web site operator into unnecessary capital expenditures and operating costs, whereas undersizing will lead to undesirable consequences in site availability, customer satisfaction, etc. Consequently, we need to be able to identify when we are dealing with outliers and when the strange-looking data points are part of a pattern.

The data for this Practical Application are saved in the ResponseTimes.csv file (as always, available upon request). The code is very simple:

resp = read.csv ("ResponseTimes.csv", strip.white = TRUE)
summary(resp)
source ("foo.r")
# We use the foo.r file again: we already have outlier detection there
RespTimeOutliers = GetOutlierIndices(
X = resp$RespTime, Upper = TRUE, Lower = TRUE, PlotTF = TRUE)

This will produce the list of all outliers and a boxplot:
Figure 7: Application with two different transactions.
(The data are in milliseconds) Or are they outliers? 

We do not want to size for outliers (see definition of outlier in the beginning of this chapter), but if the outliers are all grouped around a center or two, we may need to separate the application into two: one with fast-turnaround transactions and the other with slow transactions. In this case, the data seem to fall into two clusters, and we can use the standard kmeans() function from the R distribution to identify them:

library(utils)
clusters = kmeans(x = resp$RespTime, centers = 2)

The kmeans() function produces a list of objects, one of which is the cluster number.
resp$cl = clusters$cluster

Parenthetically speaking, the k-means method, like most methods relying on a normal distribution, is not the best possible way to detect clusters; but for the initial EDA (Exploratory Data Analysis), it will do.  We have to be very careful when to use k-means clustering; but if we have many data points for each response time, it may be said that the Central Limit Theorem will make the data appear normally distributed for the data points that we are looking at.

The plot proves what we suspected:
Figure 8: by using the k-means clustering, we were able to identify a hidden pattern in the data:
We really are dealing with two processes in this application, and in this particular case, it is better to separate the two processes into different applications, because the difference between the two is too big.

What we initially saw as outliers was associated with the longer-lived transactions returning the data. The ResponseTimes.csv file contains five hours’ worth of response times data averaged for every 5 seconds. Ninety-eight out of the 3600 five-second intervals contained enough such events that the average response time for these intervals was much higher than for the rest of the data.

Clustering is a very efficient technique here that can be used to detect patterns in such data.

Interestingly enough, if we had a continuous line of outliers in Figure 9, clustering would possibly be misleading, as these outliers would merely indicate that we have a response time distribution with a long tail on the right-hand side of the histogram. For an example, look at the Practical Application 6.
Practical Application 6: Behavior change detection
When analyzing time series, it is important to be able to detect points of sharp change in behavior, like level shifts and local trends. While the xts and zoo packages allow the user, among other things, to apply functions, like cor.test(), in a rolling manner to detect local trends, these are typically slow. A faster way to detect trend changes will be demonstrated here, but first let us review level shifts.
For an example, we will go to our old friend, application response time. We had an actual application that was chugging along just fine at around 100 milliseconds, until a new product version was released. The moment it happened, its response time jumped to 170 milliseconds. The data were recorded every minute and are available from me upon request.


Figure 9: Application response time increased as a level shift

By eyeballing, it is obvious what happened: we had a practically flat line, then a level shift, and then a nearly flat lie again. In mathematical terms, we need to look at the first derivative (rate of change).  

To make it easier to detect, we first take the hourly rolling mean of the response times:
library(zoo)
hourly = rollmean(Resp, 60)
# rollmean() from the zoo package applies
# the rolling mean (running average) function to data
x11(); par(mfcol = c(1,2))
plot(hourly, pch = "", xlab = ""); lines(hourly);
mtext("Application X hourly average response time", cex = 0.7)
boxplot(hourly);
mtext("Application X hourly average response time", cex = 0.7)

This code will produce a more comprehensible plot:

Figure 9a: hourly averages
Then we use the diff () function, since we know that it is a time series with equal time intervals, and the data are already sorted by the timestamps. (We leave it for the reader to make this code less sensitive to variations in time intervals) 

deltaR = diff(hourly)
plot(deltaR, pch = "", xlab = ""); lines(deltaR); 
mtext("Hourly avg response time Deltas", cex = 0.85)
boxplot(deltaR); mtext("Hourly avg response time Deltas", cex = 0.85)


Figure 10: the diff() function applied to hourly averages highlight the position of the level shift
We see that the diff() produced data that are a lot more noisy than the original data: we see outliers in the proximity of the main distribution and far away from it. The IQR method is obviously too sensitive. Fortunately, we remember what George Box said about the multiplier for on IQR: it is dictated by what makes sense. We are not interested in how the application behaved before or how it behaved after the level shift; only looking for the big change (gross outliers); changing the last line to boxplot(deltaR, range = 3.0) does the trick:




Figure 10 a: Level shift detection complete.

The rest is a matter of technique: we can now apply our function GetOutlierIndices (…) that we defined in the foo.r file. However, we need to modify it to allow flexibility in selecting the range for IQR, which is easy enough to do in the function definition.



Conclusions

We have demonstrated, on real-world examples from IT and from Chemical Industry, how outliers can be detected using the basic R functionality. We have seen that outliers are not just a nuisance that the world is better off without, but can provide a wealth of useful, actionable information to the analyst. 

Along the way, we went over some basic but useful R language specifics.
 
Needless to say, in detecting and making conclusions about outliers, care must be taken to ensure that the statistical assumptions about the data hold true and the correct statistical tests are used with the data. 

One thing that is very important to remember is that, while R is a very comprehensive tool, it is just a tool nevertheless, and it is up to the user to apply it correctly. R is not an AI language, and until it becomes one, it cannot possibly be smarter than the analyst using it.

One thing that was left uncovered is detection of outliers in multivariate analysis and importance of outliers in detection of periodicity in data.  I will cover these topics in the posts that are coming soon.

No comments:

Post a Comment