Thursday, June 17, 2010

Poisson, the moon, seasonal changes, and forecasting

A very special case of non-normal distribution is the Poisson distribution.  I call it special because it can be used in analyzing data for seasonality (any periodic pattern is by convention called "seasonal".  The reasons why it is so are hidden behind the shroud of mystery.  I honestly tried to understand them, but failed; so I just joined the crowd that says that seasonal patterns don't have to follow the annual cycle, and left it at that.)

In the ideal world, if there is a 7-day (or a 3-month, or a 4-week) pattern in the data, we can count on this pattern repeating consistently.  Moon phases (and consequently the tidal  movements of water) would be a good example of that.  For cases like this, Fourier analysis works wonders, allowing us to analyze the data and pick out all the precise seasonalities and then sort them in the order of magnitude (amplitude) and pick the most significant ones (more on significance later in this post).

But unless we are in a business related to open-sea fishery, offshore oil drilling, lighthouse operations, or any other where lunar phases are important, most of  the time we do not have such a clear-cut seasonal pattern: a maintenance event may be off by a day; periodic deliveries may not always happen on time; task processing may take longer than it usually does; etc.

In these cases, we see see outliers (events) following an approximately periodic pattern, with clusters of data between the outliers.  We can view this problem from two sides:



1.   We can analyze the intervals (sizes of these clusters) and, if it follows Erlang distribution, we may be able to predict the amount of traffic based on the queueing theory. Erlang distribution, however, is not very simple in data analysis (albeit very useful), as the correct formula depends on what is done to the blocked calls (packets): they may be aborted or queued.  For more on Erlang distribution, see http://en.wikipedia.org/wiki/Erlang_distribution


In addition, not all situations can be reduced to a queueing theory application, and sometimes a more generic approach is needed.

2.   We can look at the problem from the other end and sort the cluster sizes, with the assumption that some fuzziness is expected: e.g., an interval of cluster sizes of 5-8 days is weekly, whereas 26-35 is monthly, etc.  That done, we can analyze the numbers of outliers (events) that occurred during these intervals, and if the most 'prominent' period (the mode of the cluster size distribution) has had numbers of events following the Poisson distribution, then we have a seasonality.  This seasonality can then be used as a parameter in forecasting the data.

Wednesday, June 2, 2010

"Abnormal" distributions: just a reminder

The Central Limit Theorem (CLT) is very well known: no matter what the distribution of the population, random samples of sufficient size taken from this distribution will have their means normally distributed if the number of samples is "big enough".  A lot has been done in statistics to describe the normal distribution and to use it, most notably, the Z and T tests, the F test and the entire ANOVA; even the binomial distribution of discrete outcomes (success/failure) of the Bernoulli-test variety has been proven to be approximated by the normal distribution, which allows us to use the Z (T) tests to judge improvement or degradation in the reliability presented as Defects per Million Opportunities, aka DPMO.

But is the Normal distribution really that important?  What if we are sampling data from a process with a random component, and we know that there is no way that we can guarantee the randomness of the samples? 

It is great when we are running a Hadoop-type application, with millions of data rows coming in.  We can sample all we want in any way, shape, or form, and force the distribution to be Normal by fulfilling the CLT conditions.

But what if we can only measure once a day?  The condition of the CLT is not true anymore, then. Does it mean that our monthly distribution of samples cannot be judged as Normal?  What can we do to use statistical analysis on such data?

One way around the problem is to group the data into clusters and then to take random data out of the clusters.  The formality is preserved, but the quality of the data suffers.  If we have 28 data points (1 for each day), and we cluster them into weekly groups, we will end up with a group of 4 samples, and it doesn't really matter that the means of these 4 samples follow the Normal distribution.  To make things even more desperate, who knows that these 4 samples will correspond to the weekly data variations?  What if we are trying to track the weekend variations, but we start the 28-day group on a Wednesday?  We have just given away 1 (out of only 4!) seven-day sample by splitting it into Wednesday-Saturday and Sunday-Tuesday periods.

To make the long story short, we cannot always have a Normal distribution.  Then how do we make the judgment calls on process improvement, on capacity sufficiency, on difference in processes?

Parenthetically, the funny thing about the T-test it is that Student's T is NOT Normal, but at large sample sizes (high numbers of degrees of freedom), T can be approximated by the Z distribution, which is Normal. But it is the T-test, NOT a Z-test, that is to be used when comparing two means or comparing a mean with a value. Then why are we so worried about Normality in checking the equality of means? Because at DF > 30 (DF == degrees of freedom), we have to use the Z-table. So we say that the assumption is that the samples are drawn from a normal distribution.

Sadly, we rarely check that this assumption is true, especially when performing time-series analyses, and we rarely Normalize the variable we are looking at.  For example, if we are dealing with Rho - the Pearson's correlation coefficient - we can use the Fisher's transformation to make it approximately Normal and then to use the Normalized values to decide how well the model fits the data (one-sample T-test) or how different the two models are from each other, and which is better (two-sample T-test). 

For the "typical" analyses, simpler Normalizing transformations (power transformations, roots, etc) can be used. 

Special care must be exercised when going from values to their derivatives (and vice versa): in the general case, Normality does not hold for such transformations.  Most importantly, the sum and the product of two normal distributions is NOT normal, unless their means are equal (for product, even then it will most probably not be normal).

One special distribution is the Exponential distribution. It is heavily used in queueing theory, due to its ready availability for the description of Poisson processes and Markovian chains, and in reliability analysis, because MTBF (mean time between failures) is Exponentially distributed, and the Reliability function is defined using the MTBF, such that we can use the Chi-Square distribution to evaluate an improvement or degradation in reliability, if we know the change in MTBF. 

As a rule, do NOT Normalize Exponentially-distributed data: this distribution is too important for that!

For a great outline of Normal and non-Normal distributions and tests, follow this link:
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf