Wednesday, July 24, 2013

From Outliers to Process Anomalies: Predictive SPC. Part4

Predictive SPC. Part 4
Statistical Process Control (SPC) is a well described framework used to identify weak points in any process and predict the probability of failure in it.  The distribution parameters of process metrics have been translated into process capability, which evolved in the 1990s into the Six Sigma methodology in a number of incarnations. However, all techniques derived for SPC have two important weaknesses: they assume that the process metric is in a steady state and they assume that the process metric is normally distributed, or can be converted to a normal distribution.  The concepts and ideas outlined here make it possible to overcome these two shortcomings. The method has been developed and validated in collaboration with Josep Ferrandiz.  
This is the fourth post in the series on Predictive SPC.

Definitions are here.

In Part 1, we covered the Axioms and Assumptions (A&As) of SPC, explained why they are wrong, but useful.  In Part 2, we talked about the key concepts and ideas of SPC,  and Part 3 had a discussion of how the key SPC concepts and ideas will change if we consider non-stationary processes.  Now we are beginning to bring it all together

Bringing it All Together

We have established

  1. That the body of work related to Statistical Process Control (SPC) can very well be used for stationary (steady-state) data.
  2. That these methodologies can be used to identify outliers, loss of stationarity, and other deviations from what is considered normal data behavior ( “Normal” in this context does not imply normality of distribution).
  3. That these methods can not, and should not, be used to identify anomalies in nonstationary data.

Predictive SPC to the Rescue

By “Predictive SPC”, we refer to a wide class of methods that we have developed to gauge variability in nonstationary processes.  Our contribution to the development of these methods is in combining the philosophy and methodology of classic SPC with more recent advanced statistical techniques that have been introduced by researchers over the last 25 years and recently implemented in a number of statistical software systems, both general-use applications, from R to SAS and SPSS, and more application-specific packages like Autobox, ForecastPro, Netuitive, and others.

The philosophy behind Predictive SPC

The idea is simple: we know that the data are not stationary, but we want to understand whether it is an artifact of a relationship with another metric whose behavior is known, and if it is, then we want to be tracking changes in this relationship.

In this section, we outline the methodology we used successfully for identifying change in behavior of Key Performance Metrics (KPM) when they were driven by one business metrics (BMI = Business Metrics of Interest).  This methodology can easily be extended to multivariate models.

It is important to mention that this methodology works for cases where a consistently predictive model connecting the KPM with the BMI (e.g., the number of queries in the system with the business transaction rate) is known to exist, and we know which is dependent, and which is the independent variable.  The goal is to determine whether the variation in the KPM (Key Performance Metric - the Dependent Variable) is related to a change in the BMI (Business Metric of Interest - the Independent, or Explanatory, Variable), and whether this variation is predicted by this relationship.

Problem Statement:

Given:

KPM = traffic in the system
BMI = Transaction Rate (can be any explanatory variable)
Both KPM and BMI have a weekly and daily seasonality.


For the illustration purposes, I am using
a real enterprise-level use case (with data generated to provide patterns similar to what one normally sees in IT production, to protect sensitive information without losing data relationships), where for ~75% of all entities, a polynomial (quadratic) relationship was known a priori to exist between BMI and KPM, as shown in Figure 3.  Other entities were known to be independent of the business metric. 

Figure 3: Relationship between Performance Metric and Explanatory Variable.  
The red, green, and blue lines correspond to the 95th, 50th, and 5th percentile of data, respectively (more details below).

Care must be taken in applying polynomial models, as the best-fitted curve for the upper percentiles may peak in the middle and go down, whereas the best-fitted curve for the lower percentiles may do the opposite.  In implementation of such models, therefore, it is important to have a conditional model substitution for such data.  Linear or exponential models may be better predictors in such cases, even if their correlation coefficient is not as good as for polynomial.

If the model is unknown, it can be established using the methodology described in [Alexander Gilgur, Michael Perka (2006).  A Priori Evaluation of Data and Selection of Forecasting Model – presented at the 32nd International Computer Measurement Group (CMG’06) Conference, December 2012, Reno, NV] to find the best fitted Least Squares model by using the Fisher-transformed correlation coefficient as the criterion of "best fitted".

Other, application specific, methods can be used to establish the “right” model (yes, we are aware of George E.P. Box’s famous statement that “All models are wrong; some models are useful”). Such methods are outside the scope of this post.

Task:

In order to help developers identify issues in software scalability, establish a process to detect changes in behavior of Concurrency.

Solution Plan:

  1. Build regressions based on the data in a baseline time range and the latest time range.  Since the data vary weekly, there should at least be two weeks’ worth of data in each data set, in order to be able to correct for weekly variation.
  2. For a given set of BMIs (explanatory variables), compute the model predictions using models built for each of the data sets.
  3. Compare the model predictions at given BMI values of interest.

Caveats:

  1. Nonlinear nature of the relationship between the dependent and independent variables.
  2. Data are seasonal (daily and weekly), but also the relationship between the BMI and the KPM may behave differently at different times of the day and on different days of the week; in other words, the regression model has to be 3-dimensional:

KPM = f (BMI(t, d), t, d) (6)

where t = time of the day, and d = day of the week.

  1. Data cannot be assumed to be normally distributed (e.g., if BMI is transaction rate, it is a Poisson-like variable, and if KPM is the total number of queries in the system, then the KPM is an Erlang process: according to Little’s Law, total number of queries (concurrency) is a product of a Poisson distribution of request arrivals and an exponential distribution of request processing times).

A more in-depth process complexity analysis is given in, e.g., our paper presented at CMG’12 [Alex Gilgur,Josep Ferrandiz, Matthew Beason. (2012) Time-Series Analysis: Forecasting + Regression: And or Or? – presented at the 38th International Computer Measurement Group (CMG12) Conference, December 2012, Las Vegas, NV.].

Solution to the Caveats:

  1. Use quantile regression: because of the skewed (non-Gaussian) distributions of the two metrics in our case, we know that the upper (left) tail of the KPM will behave differently from the lower (right) tail of the distribution.  Quantile regression captures this difference in behavior.
  2. Derive a formula that will be a non-parametric version of the same formulae that are used in “classic” SPC.

Quantile regression offers a self-consistent, and outlier-independent, way of tracing the behavior of, e.g., the 95th percentile, the 50th percentile (median), and the 5th percentile of data.  

Other percentiles can be used as well.  Thus, if we use quantile regression for the 25th and he 75th percentiles of the data, we can easily construct a system for detecting outliers in the dependent variable (KPM = Concurrency). With outliers identified, we can then derive a stationary outlier tracking variable (number and magnitude) that will allow us to easily compare the baseline and the new data sets.  While we have successfully implemented this method as part of the implementation of Predictive SPC, further in-depth discussion is outside the scope of this paper.

The Formula

The formula that we derived is based on the same principles as the Z-score (3): the Z-score tells us how different, in standard deviations, the mean is from the target.  We want to know how different a central measure of the KPM (median) now is from what it used to be during the baseline time frame.  This has to be relative to the (5%-95%) data range.

In other words, if, e.g., at a given value of BMI - and on a given day at a given hour - the new KPM median changed by more than 3 times its range, it means that the process has changed.  If it changed by less than 1% of the KPM range, then we can say that the change is not significant.  

However, the range may not necessarily be constant.  To account for that, we propose using the average of the two ranges in the formula for the measure that we used as a process consistency.


We propose a W-score for the measurement of consistency of nonstationary processes, defined as follows:

For a given set of (BMI, T, D),

W = 2 ( M2 - M1)(R1 + R2) (7)
where
W = the process consistency score.
M1, M2 = KPM medians in the baseline (1) and in the latest (2) data sets.
R1, R2 = KPM ranges in the baseline (1) and in the latest (2) data sets.
T = Time of the day.
D = Day of the week.

The ranges are defined as the difference between the 95th and the 5th percentiles of the KPM (the dependent variable that we are monitoring).

The values of BMI at which the W is measured can be preset as a process parameter. The day of the week and time of the day at which the W is measured can be preset as process parameters as well.
It must also be noted that where the range of the dependent variable is the measure of the process quality, we should use the opposite of Eq. (7):

W = 2  (R2 - R1) (M2 + M1) (8)

Because we are not using parametric methods, we can directly subtract the ranges.

Note that the R2, R1,  M2, and M1  are all measured as values predicted by the regression at given values of the explanatory variables (in our case, BMI, time of the day, and day of the week).

Is there another way?

Yes, there is.  The formulae (7) and (8) that we are proposing are very efficient in bringing the SPC problem for non-stationary data into the realm where it becomes similar to problems already being solved by the classic SPC.  If, however, we are interested in finding the deeper patterns, we can use other non-parametric methods.  

In particular, the Kolmogorov-Smirnov (KS) test, as well as the Chi-Square Goodness of Fit (GoF) test, come in handy if we want to see how much overlap there is between the baseline and the new data distributions, and then use the p-value to determine to what degree the two distributions are still overlapping.

The non-predictive nature of these tests can easily be overcome by combining the Predictive SPC methodology with the KS or Chi-Square tests: if we slice the data by a sufficient number of quantile-regression lines, and apply the GoF tests at the BMI values of interest, we will significantly improve the power of the method and will be able to answer the questions of confidence more reliably. Specifics of the Kolmogorov-Smirnov test do not allow it to be used with discrete data, as the p-value will only be asymptotic in case of ties.  There are ways to overcome this circumstance, but they are outside the scope of this paper.  Also, the Chi-Square test does not have such limitations.
.

Outliers

Outliers (see, e.g., from here to here), from the SPC perspective, are defects - points on the timeline where the process metric did not behave “as predicted”.   With stationary data, the standard Tukey’s method works very robustly:

Find the first (Q1) and fourth (Q4) quartiles (25-percentiles); compute the interquartile range as IQR = Q4 - Q1 and then points outside the range from Q1 - 1.5 * IQR to Q4 + 1.5 * IQR are outliers.

With quantile regression, the same method can be extended to non-stationary data, as long as there is a variable correlated with the process metric whose behavior is known (e.g., time of the day, day of the week, day of the year, or number of business transactions processed, or any other variable):
Find the quantile regression predictions for  the first (Q1) and fourth (Q4) quartiles; compute the interquartile range as IQR = Q4 - Q1and then points outside the range from Q1 - 1.5 * IQR to Q4 + 1.5 * IQR are outliers.

Proceed with Caution

One important thing is that, since we are looking at a multivariate quantile regression, it becomes complicated: an outlier by the BMI may not be an outlier in day of the week, or hour of the day - or vice versa.


We have brought it all together. Stay tuned for Post 5 - Method Illustration!

No comments:

Post a Comment