Sunday, July 21, 2013

From Outliers to Process Anomalies: Predictive SPC. Part 2

Predictive SPC. Part 2
Statistical Process Control (SPC) is a well described framework used to identify weak points in any process and predict the probability of failure in it.  The distribution parameters of process metrics have been translated into process capability, which evolved in the 1990s into the Six Sigma methodology in a number of incarnations. However, all techniques derived for SPC have two important weaknesses: they assume that the process metric is in a steady state and they assume that the process metric is normally distributed, or can be converted to a normal distribution.  The concepts and ideas outlined here make it possible to overcome these two shortcomings. This method has been developed and validated in collaboration with Josep Ferrandiz.  This is the second post in the series on Predictive SPC.



Definitions are here.

In Part 1, we covered the Axioms and Assumptions (A&As) of SPC, explained why they are wrong, but useful, and now we move on to the Anatomy of SPC Charts and some of the key SPC concepts and ideas.

Anatomy of an SPC Chart

A lot has been said in the literature and blogosphere about SPC charts; some have taken it to new levels, like Igor Trubin, whose Statistical Exception Detection System (SEDS) is described in his blog.  SEDS is an advanced “classic” SPC engine that is scalable enough to be used as an enterprise-level system.  (We met at the CMG’06 Conference where I was presenting my first CMG paper - A priori Evaluation of Data and Selection of Forecasting Model, which we had written in collaboration with Michael Perka who has been my guide into the world of IT for many years now - and have been discussing Capacity Management, Mathematics, and Philosophy every time we get together, writing up ideas on napkins and posing questions that inspire new publications.)  


This post describes the SPC chart, introduces some SPC rules of thumb, and discusses “classic” Statistical Process Control.  For more details, I highly recommend the book by the one of the greatest applied statisticians of our time, George Box, [Box, G.E.P, Luceno, A. (Eds.) (1997) Statistical Control by Monitoring and Feedback Adjustment; Wiley Series in Probability and Statistics; Copyright 1997, John Wiley and Sons, Inc.; ISBN 0-471-19046-2]


Figure 1 demonstrates a typical SPC chart of a single process that is within the specifications, but out of control.



Figure 1:  A process within specifications (UCL LSL ), but out of control.


If it is a time series (all data points are measured at constant time intervals), then all methods used in time series analysis are applicable to the process described by a chart like in Figure 1.


In this particular example, we see that the process is stationary (there is no trend in the data), but there are two problems: the mean (the magenta dash line) is off target (the blue solid line in the middle), and there is an outlier (data points marked with a red triangle).  In addition, the later half of the run points is on one side of the target, while in the earlier half of the run most of the measurements were on the other side.


We could use a number of standard outlier detection techniques here to identify such data points (you can find some of them here or here), and we could use the standard T-test to check that the mean of the data is below the target.  For more details on classic SPC techniques, see the Box & Luceno book..

Making things more interesting

If we have a number of similar processes (e.g., widgets coming from a number of machines, or CPU utilization in a data center across servers, or SAT tests results from one school district) characterized by time series similar to that in Figure 1, we should be able to bundle them up, introducing not only distribution in time, but also distribution across the bundle.


In this case, the variance that goes into the T-test (as standard deviation, or square root of the variance) and into the ANOVA (as variance) becomes



The number of data points becomes:



And the data points that we saw in Figure 1 as outliers may very well turn out to be regular data points.

How do we measure the quality?

For measurement of the process quality, two standard parameters have evolved and become the lingua franca of SPC:




In other words, if the data follow normal distribution, we want  to be better than a certain “critical” number (the higher the better), and we want to be better than a certain “critical” number (the higher the better).


As commonly accepted rules of thumb, it is customary to use Z = 6  as a lofty goal and Cpk = 1.33 as another lofty goal.


The Z Score:

The reasoning behind the critical number for the Z score is as follows:  if we have Z = 6.0, it means that we have zero defects – an impossible scenario.  For most industries, this requirement can be relaxed:  Z = 4.5 translates into a defect rate of 3.4 / 1 million, or 3.4 Defects Per Million Opportunities (DPMO).  


Parenthetically, The Holy Grail of Six Sigma methodology, Z = 6.0 goal is for short-term slices (samples) of the process; the samples are assumed to be distributed so that their range accounts for 1.5 Z-scores (for details, see the F-test assumptions).  That means that the short-term Z = 6.0 is really long-term Z = 4.5


A paradox:

If we look at civil aviation, 3.4 DPMO translates into 1 aircraft failure in midflight every 3.3 days.  Indeed, according to http://1lawflying.wordpress.com/2008/11/04/how-many-flights-per-day-do-air-traffic-controllers-handle-in-the-united-states/, as of November, 2008, every day there were 87,000 simultaneous flights in the skies over the United States. That means that a million flights (opportunities to fail) is accumulated in 1,000,000 / 87,000 = 11.5 days, and allowing 3.4 of these to fail means that on average every 11.5/3.4 = 3.38 days a flight is allowed to fail.  Nobody would want to be on that flight; so the critical Z score for civil aviation should be higher than 4.5.


The solution to this paradox is really quite simple if we consider the distribution of failures. Failure arrivals are always a process skewed to the left - typically, Poisson (although in some models Pareto, Weibull, and other heavy-tailed distributions are used); in other words, failure interarrival times (Times Between Failures – a measure used in reliability analysis) can typically be assumed to be exponentially distributed.  Aircraft injection into the airspace (takeoff or flying into its scope) and removal from it (landing or flying out of its scope) are also heavy-tailed processes.


A combination of these distributions will not produce a normal distribution under any circumstances. Therefore, the Z score, being devised as a measure of quality for a normally-distributed process, is not a good measure of the quality (and safety) of operation of the National Airspace.  This is another example of not allowing the CLT to mislead us.

The Process Capability:

Unlike the Z score, the Process Capability measures how far the process is from the specification limits.  It does not take into account how close to the target the process metric is, but it takes account of the fact that specification limits can be asymmetrical.  The 1.33 lowest limit is somewhat arbitrary: we want to leave a padding of 1/3 of the 3-standard-deviation range before we hit the upper or lower specification limit, whichever is closer (USL or LSL, respectively).  Hence


Cpk_min = 1.33 = 4 / 3 (5)


We have covered the anatomy of an SPC chart. Some of the concepts appear arbitrary (and are arbitrary), but are adjustable based on "what makes sense for the business". We'll talk more about business metrics in later posts in this series.

This concludes the section on Anatomy of an SPC Chart.  Stay tuned for Trended Time Series and SPC.

No comments:

Post a Comment