There is a common misconception that outliers are always a bad thing, and once we get rid of them, everything will fall into place. This is the first post in a series on the role of outliers in data, as well as ways to detect and quantify them, correct for them, and use them as important indicators of change.
Due to the exponential growth of the number of new packages for R, I will focus this series on the basic R and some (hopefully not too many) packages designed to make the analyst's life simple, as opposed to confusing it even more than it already is. Code snippets are provided, along with real-life data (available upon request) and a discussion of results.
I will also cover cluster analysis and how this technique can be used to detect outliers, but that's in later posts in this series.
Every now and then, you will see explanations of how R code works; if you know this stuff already, just skip over such places.
There will be several installments in this series, so stay tuned.
___________________________________________________________________________
Size matters: how many data points do we need?
Consider an example of
time series.
When
we model a time series, we want our model to be “correct” in the sense that at
each moment in time, we want the model output to be “not too far off” from the
actual measured data value. Example
– historical data for a storage disk usage (in Terabytes). Figure 1 shows disk usage with periodic a
periodic component and a lot of variability in data. Figure 2 shows disk usage with very little,
if any, variance in data, but a significant
trend.
Figure
1 illustrates that when the sample is insufficient, we cannot be confident in
making any call regarding the data: trends, outliers, level shifts, periodicity,
etc are all hidden in the non-represented data.
When we look at the 12 data points, we do not know yet whether the data
are seasonal, trended, or if there are simply a lot of random irregularities in
the data. A bigger sample is needed in
this case.
Figure
2 is a different time series. The data
variation is significantly less here, and even 12 data points would suffice to
tell that the data are linear and not seasonal.
In
other words, there is no single number such that, as long as the sample size is
greater than this number, we can be confident that what we say about the data is
true. However, what we can do is use the
apparatus of statistical hypothesis testing.
Then
the fuzzy “not too far off” can be converted to error tolerance, and we can use
a method derived from the Student’s T-test (well described in the literature and
in, e.g., http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm)
to evaluate the data size needed to guarantee that the desired percentage of
data values (aka the confidence levels) will fall within the “not too far off”
error tolerance framework. The math of
this evaluation is very well described in the literature on
statistics.
For
instance, in the example shown in Figure 2b, the sample is sufficient (even 4
data points would suffice to be 98% confident that the error of the model will
not be “too far off”). By contrast, in
the example of Figure 1b, judging by the sample shown – only 12 data values –
130 points would be needed to be 98% confident that the error of the model will
be within the tolerance.
Naturally,
reducing the desired confidence levels and increasing the error tolerance that
the user can live with would improve the odds of hitting a sample that is
sufficient. For example, only 92 data
points will suffice in the example of Figure 1b if we can accept that only 90%
of all data points to be within the desired tolerance of the measured value at
each time stamp.
Conclusion:
Size
matters, and frustratingly enough, we cannot, based on a small sample, decide
that the sun now sets in the East and rises in the West. (This also informs an important conclusion
about giving people second chances: a rough start does not necessarily mean a
rough ride, and vice versa)
We run into a circular-logic paradox: we need enough data to be able to tell how the data behaves, but the answer to the "how much is enough?" question depends on the data behavior.
One possible solution is to use Bayesian approach: start with a prior hypothesis and stay with it until proven wrong.
(To be continued...)
No comments:
Post a Comment