The Central Limit Theorem (CLT) is very well known: no matter what the distribution of the population, random samples of sufficient size taken from this distribution will have their means normally distributed if the number of samples is "big enough". A lot has been done in statistics to describe the normal distribution and to use it, most notably, the Z and T tests, the F test and the entire ANOVA; even the binomial distribution of discrete outcomes (success/failure) of the Bernoulli-test variety has been proven to be approximated by the normal distribution, which allows us to use the Z (T) tests to judge improvement or degradation in the reliability presented as Defects per Million Opportunities, aka DPMO.
But is the Normal distribution really that important? What if we are sampling data from a process with a random component, and we know that there is no way that we can guarantee the randomness of the samples?
It is great when we are running a Hadoop-type application, with millions of data rows coming in. We can sample all we want in any way, shape, or form, and force the distribution to be Normal by fulfilling the CLT conditions.
But what if we can only measure once a day? The condition of the CLT is not true anymore, then. Does it mean that our monthly distribution of samples cannot be judged as Normal? What can we do to use statistical analysis on such data?
One way around the problem is to group the data into clusters and then to take random data out of the clusters. The formality is preserved, but the quality of the data suffers. If we have 28 data points (1 for each day), and we cluster them into weekly groups, we will end up with a group of 4 samples, and it doesn't really matter that the means of these 4 samples follow the Normal distribution. To make things even more desperate, who knows that these 4 samples will correspond to the weekly data variations? What if we are trying to track the weekend variations, but we start the 28-day group on a Wednesday? We have just given away 1 (out of only 4!) seven-day sample by splitting it into Wednesday-Saturday and Sunday-Tuesday periods.
To make the long story short, we cannot always have a Normal distribution. Then how do we make the judgment calls on process improvement, on capacity sufficiency, on difference in processes?
Parenthetically, the funny thing about the T-test it is that Student's T is NOT Normal, but at large sample sizes (high numbers of degrees of freedom), T can be approximated by the Z distribution, which is Normal. But it is the T-test, NOT a Z-test, that is to be used when comparing two means or comparing a mean with a value. Then why are we so worried about Normality in checking the equality of means? Because at DF > 30 (DF == degrees of freedom), we have to use the Z-table. So we say that the assumption is that the samples are drawn from a normal distribution.
Sadly, we rarely check that this assumption is true, especially when performing time-series analyses, and we rarely Normalize the variable we are looking at. For example, if we are dealing with Rho - the Pearson's correlation coefficient - we can use the Fisher's transformation to make it approximately Normal and then to use the Normalized values to decide how well the model fits the data (one-sample T-test) or how different the two models are from each other, and which is better (two-sample T-test).
For the "typical" analyses, simpler Normalizing transformations (power transformations, roots, etc) can be used.
Special care must be exercised when going from values to their derivatives (and vice versa): in the general case, Normality does not hold for such transformations. Most importantly, the sum and the product of two normal distributions is NOT normal, unless their means are equal (for product, even then it will most probably not be normal).
One special distribution is the Exponential distribution. It is heavily used in queueing theory, due to its ready availability for the description of Poisson processes and Markovian chains, and in reliability analysis, because MTBF (mean time between failures) is Exponentially distributed, and the Reliability function is defined using the MTBF, such that we can use the Chi-Square distribution to evaluate an improvement or degradation in reliability, if we know the change in MTBF.
As a rule, do NOT Normalize Exponentially-distributed data: this distribution is too important for that!
For a great outline of Normal and non-Normal distributions and tests, follow this link:
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
No comments:
Post a Comment