Sunday, November 25, 2012

R and Outliers: Part 6. Trended Data. 1: Simplest Cases



Outliers in Dependent Variable

Let’s consider a simple case of a time series with a linear trend and some data points that “don’t belong” (Figure 1)


Figure 1: A time Series with outliers

If we build a boxplot for this time series, we will not see any outliers:
Figure 2: Boxplot of the time series in Figure 1


This becomes obvious if we look at the two together:

 
Figure 3: the two plots together

The outliers are masked by the trend: the TimeSeries in this case is a dependent variable, and in order to identify the outliers, we need to de-trend the data.

We have two ways to do that:

Get the diff ():

It is a valid approach, and will work for linear correlations, like in Figure 1.   The idea is to bring the slope of the line to zero, and then all the outliers will be visible.
The trick to using it for non-linear correlations is in the Taylor series decomposition (see, e.g., http://en.wikipedia.org/wiki/Taylor_series for definition):

Equation 1
or its Newton polynomial version:


Equation 2
with the Newton basis polynomials defined as

Equation 3

This suggests a convenient way to flatten the line: keep taking the differences until the slope (expressed as difference on the next step) becomes “as good as zero”.

In the linear case, the R code would be (the data file is available upon request):

Data = read.csv ("C:/Sasha/RnD/RDMBook/TimeSeries_Trend_Outliers.csv",
                     strip.white = TRUE)
Diff1 = data.frame (TS = Data [(2:nrow (Data)),]$TS, DeltaY = diff(as.numeric(Data$TimeSeries)),
       DeltaX = diff (as.numeric (Data$TS))
)
Diff1$Slope1 = as.numeric (Diff1$DeltaY)/as.numeric (Diff1$DeltaX)
x11(); par (mfcol = c (1, 2))
plot (Slope1 ~ TS, data = Diff1)
boxplot (Diff1$Slope1)

and the plots produced are shown in Figure 4.
 
Figure 4: First derivative estimation calculated as diff (TimeSeries) / diff (TS)

It is obvious that the outliers that we perceived by eyeballing the data in Figure 1 are included in the outliers detected after the differencing, but the data became much noisier.  It makes sense if we consider that the moving-average filter is the equivalent of integration; the opposite action, differentiation, would increase the noise.
Needless to say, each next step of differentiation would result in more and more noise, until we will not be able to see the trees behind the forest.

Conclusion:

Differentiation works for detecting outliers in the dependent variable, but watch out for the additional noise!

Brute-force approach:

Build a regression, get the residuals, find outliers in residuals.  This approach is simple and effective.  Regression in R is very efficient, especially if we are dealing with regular least-squares regression, and in our case, we have a linear correlation between the dependent variable (TimeSeries) and the independent variable (TS).

R code:

LinRegr = lm (TimeSeries ~ TS, data = Data)
LinRegr
Model = predict (LinRegr)
x11(); par (mfcol = c (2, 1))
plot (TimeSeries ~ TS, data = Data, ylab = "TimeSeries")
lines (as.numeric (Model) ~ Data$TS, lwd = 4, pch = 5, col = "red")
plot (as.numeric (LinRegr$resid) ~ Data$TS, ylab = "LinRegr$resid", xlab = "TS")

This code produces the plot in Figure 5 (the red dotted arrows were added later by hand).

Figure 5: Brute-force approach

This has the potential to detect all the points “that don’t belong” as outliers and no false positives.  Verifying that by using the now familiar IQR-based approach:

Recall that earlier we had a number of functions defined in the file foo.r, one of them being GetOutlierIndices (…).  The full function definition looks like this:
                                       
GetOutlierIndices = function(X, Upper = TRUE, Lower = TRUE,
RangeMplr = 1.5, PlotTF = FALSE){

       Upper = TRUE; Lower = FALSE

       if (PlotTF)   boxplot(X)
      
       smry = summary(X)
       iqr = IQR(X)
       upper = smry[[5]] + RangeMplr * iqr
       lower = smry[[2]] - RangeMplr * iqr
      
       UpperOLindices = which (Upper & X > upper)
       LowerOLindices = which (Lower & X < lower)
      
       Outliers = c(LowerOLindices, UpperOLindices)
       Outliers
}

Now all we need to do is put a reference to the foo.r file into the code and use the function:

source ("foo.r")
x11(); par (mfcol = c (1,1)) # prepare room for the new plot
OLInds = GetOutlierIndices (X = LinRegr$resid, RangeMplr = 1.5,
Upper = TRUE, Lower = TRUE, PlotTF = TRUE)
>OLInds
589 643 724 845 988 997
The code snippet that follows only serves the purpose of showing the plot in Figure 6:
Residuals = data.frame (Deltas = LinRegr$resid, TS = Data$TS)
Residuals [-OLInds,]$Deltas = 0
x11(); par (mfcol = c (1,1)) # prepare room for the new plot
plot (as.numeric (LinRegr$resid) ~ Data$TS, ylab = "LinRegr$resid", xlab = "TS")
points (Deltas ~ TS, data = Residuals, ylab = "LinRegr$resid", xlab = "TS",
col = "red", pch = 5, cex = 0.8)
legend (x = "topleft", pch = c(1, 5), col = c("black", "red"),
legend = c ("Residuals", "Outliers"))

 

Figure 6: Residuals plotted with outliers

We found all 6 outliers, and their indices correspond to the indices of the original “data points that don’t belong”.  By adjusting the RangeMplr in the GetOutlierIndices () call, we can further fine-tune the detection (for example, exclude points 589 and 997) without making the data more noisy than they already are.

Summary

To summarize, detection of outliers can be described in three steps (Figure 7):
1.       Fit a curve into the data
2.       Compute the residuals (errors)
3.       Find outliers in the residuals; these are the outliers.

 

Figure 7: Outlier detection algorithm

 

Outliers with “Dangerous Curves”

The linear dependency case is merely a very simple demonstration of the approach outlined in Figure 7; there is no reason, however, not to use it for nonlinear regressions.  When we have nonlinear relationship, we need to consider very the following questions:
1.       Can we use local regression with the data?
·         LOESS (LOcal regrESSion) and its cousin LOWESS (LOcal Weighted regrESSion) will very often be the tools of choice in this kind of analysis if we are only interested in finding outliers regardless of what they mean.
                                                               i.      For functions in R that will loess or lowess curves into your data,  type “?lowess” and “?loess” in the R GUI for details
·         Spline and its multivariate counterpart, MARS, and other piecewise regression methods
                                                               i.      There are some very nice blogs and lectures online describing the use of spline and GAM (Generalized Additive Model) in R, like this one:  http://statisticsr.blogspot.com/2008/10/notes-for-polynomial-regression-and.html
                                                             ii.      Or this one: http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_16.pdf
·         One thing to look out for is overfitting: if the curve goes through every data point (in the extreme case), then we will not be able to find outliers in the residuals.
2.       Do we know the physics behind the data (by physics I am referring to the general knowledge why the data behave in this manner)?
3.       Is time a predictor variable?
·         If it is, then we can use the Exponential-Smoothing Moving Average (EWMA) and / or Autoregressive Integrated Moving Average (ARIMA) techniques to find the best-fitted curves.  For more details on fine points related to the use of forecasting and regression in Time-Series Analysis (TSA), go to  http://alexonsimanddata.blogspot.com/2011/12/time-series-forecasting-regression-and.html .  More in the Proceedings of the CMG06 and CMG12 Conferences (www.cmg.org).
4.       What are the possible “linearizable” curves that can describe this relationship?
There are 4 basic “linearizable” curves:
·       
The linearizations as shown above can be derived by simple algebra methods of substitution.  How to select the best-fitting regression is a fascinating subject, but it is too big for this post.  Stay tuned!

Conclusion

For detecting outliers in dependent variables, we can always use the Figure 7 algorithm.  Of course the devil is in the details.  We will cover them later.