Outliers in Dependent Variable
Let’s consider a simple case of a time series with a linear
trend and some data points that “don’t belong” (Figure 1)
If we build a boxplot for this time series, we will not see
any outliers:
Figure 2: Boxplot of the time series
in Figure 1
This becomes obvious if we look at the two together:
Figure 3: the two plots together
The outliers are masked by the trend: the TimeSeries in this
case is a dependent variable, and in order to identify the outliers, we need to
de-trend the data.
We have two ways to do that:
Get the diff ():
It is a valid approach, and will work for linear
correlations, like in Figure 1. The idea is to bring the slope of the line
to zero, and then all the outliers will be visible.
The trick to using it for non-linear correlations is in the
Taylor series decomposition (see, e.g., http://en.wikipedia.org/wiki/Taylor_series
for definition):
Equation 1
or its Newton polynomial version:
Equation 2
with the Newton basis polynomials defined as
Equation 3
This suggests a convenient way to flatten the line: keep
taking the differences until the slope (expressed as difference on the next
step) becomes “as good as zero”.
In the linear case, the R code
would be (the data file is available upon request):
Data =
read.csv ("C:/Sasha/RnD/RDMBook/TimeSeries_Trend_Outliers.csv",
strip.white = TRUE)
Diff1 = data.frame (TS =
Data [(2:nrow (Data)),]$TS, DeltaY = diff(as.numeric(Data$TimeSeries)),
DeltaX = diff (as.numeric (Data$TS))
)
Diff1$Slope1
= as.numeric (Diff1$DeltaY)/as.numeric (Diff1$DeltaX)
x11();
par (mfcol = c (1, 2))
plot
(Slope1 ~ TS, data = Diff1)
boxplot
(Diff1$Slope1)
and the plots produced are shown
in Figure 4.
It is obvious that the outliers that we perceived by
eyeballing the data in Figure 1
are included in the outliers detected after the differencing, but the data
became much noisier. It makes sense if
we consider that the moving-average filter is the equivalent of integration;
the opposite action, differentiation, would increase the noise.
Needless to say, each next step of differentiation would
result in more and more noise, until we will not be able to see the trees
behind the forest.
Conclusion:
Differentiation works for detecting outliers in the
dependent variable, but watch out for the
additional noise!
Brute-force approach:
Build a regression, get the residuals, find outliers in
residuals. This approach is simple and
effective. Regression in R is very
efficient, especially if we are dealing with regular least-squares regression,
and in our case, we have a linear correlation between the dependent variable
(TimeSeries) and the independent variable (TS).
R code:
LinRegr = lm (TimeSeries ~ TS, data = Data)
LinRegr
Model = predict (LinRegr)
x11(); par (mfcol = c (2, 1))
plot (TimeSeries ~ TS, data = Data, ylab =
"TimeSeries")
lines (as.numeric (Model) ~ Data$TS, lwd = 4,
pch = 5, col = "red")
plot (as.numeric (LinRegr$resid) ~ Data$TS,
ylab = "LinRegr$resid", xlab = "TS")
This code produces the plot in Figure 5 (the red dotted arrows were added later by hand).
Figure 5: Brute-force approach
This has the potential to detect all the points “that don’t
belong” as outliers and no false positives.
Verifying that by using the now familiar IQR-based approach:
Recall that earlier we had a number
of functions defined in the file foo.r, one of them being GetOutlierIndices
(…). The full function definition looks
like this:
GetOutlierIndices = function(X,
Upper = TRUE, Lower = TRUE,
RangeMplr = 1.5, PlotTF = FALSE){
Upper = TRUE; Lower = FALSE
if (PlotTF) boxplot(X)
smry = summary(X)
iqr = IQR(X)
upper = smry[[5]] + RangeMplr * iqr
lower = smry[[2]] - RangeMplr * iqr
UpperOLindices = which (Upper & X > upper)
LowerOLindices = which (Lower & X < lower)
Outliers = c(LowerOLindices, UpperOLindices)
Outliers
}
Now all we need to do is put a reference to the foo.r file
into the code and use the function:
source ("foo.r")
x11(); par (mfcol = c (1,1))
# prepare room for the new plot
OLInds = GetOutlierIndices
(X = LinRegr$resid, RangeMplr = 1.5,
Upper =
TRUE, Lower = TRUE, PlotTF = TRUE)
>OLInds
589 643 724 845 988 997
The code snippet that follows only serves the purpose of
showing the plot in Figure 6:
Residuals = data.frame (Deltas = LinRegr$resid,
TS = Data$TS)
Residuals [-OLInds,]$Deltas = 0
x11(); par (mfcol = c (1,1)) # prepare room for
the new plot
plot (as.numeric (LinRegr$resid) ~ Data$TS,
ylab = "LinRegr$resid", xlab = "TS")
points (Deltas ~ TS, data = Residuals, ylab =
"LinRegr$resid", xlab = "TS",
col = "red", pch = 5, cex
= 0.8)
legend (x = "topleft", pch = c(1, 5),
col = c("black", "red"),
legend = c ("Residuals",
"Outliers"))
We found all 6 outliers, and their indices correspond to the
indices of the original “data points that don’t belong”. By adjusting the RangeMplr in the GetOutlierIndices () call,
we can further fine-tune the detection (for example, exclude points 589 and
997) without making the data more noisy than they already are.
Summary
To summarize, detection of outliers can be described in three
steps (Figure
7):
1.
Fit a curve into the data
2.
Compute the residuals (errors)
3.
Find outliers in the residuals; these are the outliers.
Outliers with “Dangerous Curves”
The linear dependency case is merely a very simple
demonstration of the approach outlined in Figure 7; there is no reason,
however, not to use it for nonlinear regressions. When we have nonlinear relationship, we need
to consider very the following questions:
1.
Can we use local regression with the data?
·
LOESS (LOcal regrESSion) and its cousin LOWESS (LOcal
Weighted regrESSion) will very often be the tools of choice in this kind of
analysis if we are only interested in finding outliers regardless of what they
mean.
i.
For functions in R that will loess or lowess
curves into your data, type “?lowess” and “?loess” in the R GUI for details
·
Spline and its multivariate counterpart, MARS,
and other piecewise regression methods
i.
There are some very nice blogs and lectures
online describing the use of spline and GAM (Generalized Additive Model) in R,
like this one: http://statisticsr.blogspot.com/2008/10/notes-for-polynomial-regression-and.html
ii.
Or this one: http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_16.pdf
·
One thing to look out for is overfitting: if the
curve goes through every data point (in the extreme case), then we will not be
able to find outliers in the residuals.
2.
Do we know the physics behind the data (by
physics I am referring to the general knowledge why the data behave in this
manner)?
3.
Is time a predictor variable?
·
If it is, then we can use the
Exponential-Smoothing Moving Average (EWMA) and / or Autoregressive Integrated
Moving Average (ARIMA) techniques to find the best-fitted curves. For more details on fine points related to the
use of forecasting and regression in Time-Series Analysis (TSA), go to http://alexonsimanddata.blogspot.com/2011/12/time-series-forecasting-regression-and.html
. More in the Proceedings of the CMG06
and CMG12 Conferences (www.cmg.org).
4.
What are the possible “linearizable” curves that
can describe this relationship?
There are 4 basic “linearizable” curves:
·
The linearizations as shown above can be derived by simple
algebra methods of substitution. How to
select the best-fitting regression is a fascinating subject, but it is too big
for this post. Stay tuned!
Conclusion
For detecting outliers in dependent variables, we can always
use the Figure 7
algorithm. Of course the devil is in the
details. We will cover them later.
No comments:
Post a Comment