Thursday, November 22, 2012

R and Outlliers. Part 1: When Enough is not Enough

(Apologies for accidentally deleting this post; restoring)


There is a common misconception that outliers are always a bad thing, and once we get rid of them, everything will fall into place. This is the first post in a series on the role of outliers in data, as well as ways to detect and quantify them, correct for them, and use them as important indicators of change. 

Due to the exponential growth of the number of new packages for R, I will focus this series on the basic R and some (hopefully not too many) packages designed to make the analyst's life simple, as opposed to confusing it even more than it already is.  Code snippets are provided, along with real-life data (available upon request) and a discussion of results.

I will also cover cluster analysis and how this technique can be used to detect outliers, but that's in later posts in this series.

Every now and then, you will see explanations of how R code works; if you know this stuff already, just skip over such places.


There will be several installments in this series, so stay tuned. 

___________________________________________________________________________

Size matters: how many data points do we need?

Consider an example of time series.
When we model a time series, we want our model to be “correct” in the sense that at each moment in time, we want the model output to be “not too far off” from the actual measured data value. Example – historical data for a storage disk usage (in Terabytes). Figure 1 shows disk usage with periodic a periodic component and a lot of variability in data. Figure 2 shows disk usage with very little, if any, variance in data, but a significant trend.



Figure 1 illustrates that when the sample is insufficient, we cannot be confident in making any call regarding the data: trends, outliers, level shifts, periodicity, etc are all hidden in the non-represented data. When we look at the 12 data points, we do not know yet whether the data are seasonal, trended, or if there are simply a lot of random irregularities in the data. A bigger sample is needed in this case.

 

Figure 2 is a different time series. The data variation is significantly less here, and even 12 data points would suffice to tell that the data are linear and not seasonal.

In other words, there is no single number such that, as long as the sample size is greater than this number, we can be confident that what we say about the data is true. However, what we can do is use the apparatus of statistical hypothesis testing.

Then the fuzzy “not too far off” can be converted to error tolerance, and we can use a method derived from the Student’s T-test (well described in the literature and in, e.g., http://www.itl.nist.gov/div898/handbook/eda/section3/eda352.htm) to evaluate the data size needed to guarantee that the desired percentage of data values (aka the confidence levels) will fall within the “not too far off” error tolerance framework. The math of this evaluation is very well described in the literature on statistics.

For instance, in the example shown in Figure 2b, the sample is sufficient (even 4 data points would suffice to be 98% confident that the error of the model will not be “too far off”). By contrast, in the example of Figure 1b, judging by the sample shown – only 12 data values – 130 points would be needed to be 98% confident that the error of the model will be within the tolerance.

Naturally, reducing the desired confidence levels and increasing the error tolerance that the user can live with would improve the odds of hitting a sample that is sufficient. For example, only 92 data points will suffice in the example of Figure 1b if we can accept that only 90% of all data points to be within the desired tolerance of the measured value at each time stamp.
Conclusion:
Size matters, and frustratingly enough, we cannot, based on a small sample, decide that the sun now sets in the East and rises in the West. (This also informs an important conclusion about giving people second chances: a rough start does not necessarily mean a rough ride, and vice versa)

We run into a circular-logic paradox: we need enough data to be able to tell how the data behaves, but the answer to the "how much is enough?" question depends on the data behavior.

One possible solution is to use Bayesian approach: start with a prior hypothesis and stay with it until proven wrong.

(To be continued...)

No comments:

Post a Comment