Sunday, July 28, 2013

From Outliers to Process Anomalies: Predictive SPC. Part 5

Predictive SPC. Part 5
Statistical Process Control (SPC) is a well described framework used to identify weak points in any process and predict the probability of failure in it.  The distribution parameters of process metrics have been translated into process capability, which evolved in the 1990s into the Six Sigma methodology in a number of incarnations. However, all techniques derived for SPC have two important weaknesses: they assume that the process metric is in a steady state and they assume that the process metric is normally distributed, or can be converted to a normal distribution.  The concepts and ideas outlined here make it possible to overcome these two shortcomings. This method has been developed and validated in collaboration with Josep Ferrandiz.  

This is the concluding post in the series on Predictive SPC.


Definitions are here.


In Part 1, we covered the Axioms and Assumptions (A&As) of SPC, explained why they are wrong, but useful.  In Part 2, we talked about the key concepts and ideas of SPC,  and Part 3 had a discussion of how the key SPC concepts and ideas will change if we consider non-stationary processes. Finally, in Part 4, we brought it all together, and now we are moving on to an illustration of the methodology.


Method Illustration

Examples that we are using for illustration are taken from the world of Client-Server applications in IT.  The business objective in IT operations is, among many, to always stay ahead of the business curve, so that when the new business opportunities propel the company to new levels, there is enough hardware to sustain it.  One of the ways to do it is to reduce the amount of resource needed to sustain a given business rate.  In IT, the KPM can be traffic (Concurrency = number of concurrent queries in the system at any time), CPU load, memory utilization, data flow rate, latency, etc.  The BMI can be transaction rate, revenue, all kinds of confidence indices, etc.


Figure 4 shows the values of the chosen KPM as predicted by the same model based on the data observed in the Baseline and data observed in the new data set.  Figure 4 shows that the traffic became higher in the new time frame than it was during the Baseline time segment: the nature of the data changed.  The second-order (quadratic) component in the relationship between the KPM and the business metric is now statistically significant.


Figure 4: Illustration of Process Degradation:
predictions from the model built on the baseline data are to the left of the dividing vertical line; predictions based on the recent data are to the right.



In this this particular application, the same model, applied to baseline and new data, predicted vastly different KPM values: new data gave much higher numbers than baseline.  We find that the W varies from +0.5 at low values of the business metric to +5.3 at high BMI.  Conclusion: this application has recently degraded.


Positive values of the W score indicate process degradation.


On the other hand, Figure 5 shows another case, where the W went from -0.2 to -7.4


Figure 5: Illustration of Process Improvement:
predictions from the model built on the baseline data are to the left of the dividing vertical line; predictions based on the recent data are to the right.

In Figure 5, the same model applied to new data shows lower values of KPM than when it was applied to the baseline data.  The negative value of the W indicates process improvement, and in this case we find W = -0.2 at lower BMI values and W = -7.4 in the high-BMI zone.

Usage

Figures 4 and 5 present illustrations of the use of the method when the system degraded (Figure 4) and the system improved (Figure 5).  Figure 6 presents the algorithm that we implemented for Predictive SPC.


For reasons of information sensitivity, all the data in this post have been generated to replicate patterns similar to what one usually sees in the real IT production environment.



Figure 6: Predictive SPC core algorithm


The strength of the core algorithm in Figure 6 is in its applicability to multiple entities and its metric-ignorance: it can provide a single-number measure of enterprise-wide performance quality, regardless of the metrics used for a particular facet of the performance.


The output can be visualized using Heat Maps, Dashboards, and other techniques.  Standard off the shelf data visualization tools, such as Tableau, SAS, R visualization packages (ggplot2 and / or lattice), as well as specialized tools, can be used to provide high-quality graphics.  

Enterprise-level applicability

The main requirements for enterprise-level applicability of the method we are describing, namely scalability and ease of interpretation, are met by using parallel processing and visualization.  HeatMaps offer a great way to visualize many metrics for multiple entities at once, but that is outside the scope of this post.  The W as a measure of process consistency offers a way to identify the worst and the best applications (high and low outliers, respectively), “Top 10” units, etc.

Conclusions

This is the final post in the series describing a novel, robust, reliable, scalable, self-consistent, and easy-to-interpret methodology for measuring process quality in a non-stationary system.  


This methodology can be extended to stationary processes and therefore is inclusive of the current state of the art.  Needless to say, the KPM, BMI, and other terms were used throughout the paper for the sole purpose of illustration.  


A way to establish a quantile regression model in cases where the model structure is unknown has been outlined in general terms here.  In implementation of polynomial models, it is important to have a conditional model substitution (see here).  Linear or exponential models may be better predictors in such cases, even if their correlation coefficient is not as good as for polynomial.


This methodology can be used with any system where a predictive model can be applied to the baseline and recent data.  There is no need to normalize the data by applying the Box-Cox and other transformations: the methods we are using are entirely non-parametric.  


This Predictive SPC methodology has been successfully implemented in an Enterprise-level tool for IT in an environment where stationarity is not to be expected.  The tool implementing this methodology can be used to quickly (within two weeks) highlight the problems with new production releases and help the engineers identify and fix the problems whose identification alone would otherwise have taken several months.


And that concludes the series on Predictive SPC

References

[1] Ricci, V. (2005).  Fitting Distributions with R. http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  Downloaded 06/02/2010.
[2] Box, G.E.P, Luceno, A. (Eds.) (1997) Statistical Control by Monitoring and Feedback Adjustment; Wiley Series in Probability and Statistics; Copyright 1997, John Wiley and Sons, Inc.; ISBN 0-471-19046-2
[3] Josep Ferrandiz, Alex Gilgur. (2012) Level of Service Based Capacity Planning. – published at the 38th International Computer Measurement Group (CMG12) Conference, December 2012, Las Vegas, NV.
[4] Chapman, C. (2003) Analysis of Time Series. An Introduction (6th ed.) Chapman and Hall, 2003.  
[5] Shumway, R.H., Stoffer, D.S. (2006) Time Series Analysis and Its Applications with R examples, 2nd.ed.  Springer Texts in Statistics; Copyright 2006.  ISBN 0-387-29317-5.
[6] Josep Ferrandiz, Alex Gilgur. (2012) A Note on Knee Detection. – published at the 38th International Computer Measurement Group (CMG12) Conference, December 2012, Las Vegas, NV.
[7] Alexander Gilgur, Michael Perka (2006).  A Priori Evaluation of Data and Selection of Forecasting Model – published at the 32nd International Computer Measurement Group (CMG06) Conference, December 2012, Reno, NV.
[8] Alex Gilgur,Josep Ferrandiz, Matthew Beason. (2012) Time-Series Analysis: Forecasting + Regression: And or Or? – published at the 38th International Computer Measurement Group (CMG12) Conference, December 2012, Las Vegas, NV.
[9] Alex Gilgur.  Little’s Law assumptions: “But I still wanna use it!” The Goldilocks solution to sizing the system for non-steady-state dynamics.  CMG MeasureIT journal, Iss. 13.6, June, 2013.  http://www.cmg.org/measureit/issues/mit100/m_100_6.pdf



Wednesday, July 24, 2013

From Outliers to Process Anomalies: Predictive SPC. Part4

Predictive SPC. Part 4
Statistical Process Control (SPC) is a well described framework used to identify weak points in any process and predict the probability of failure in it.  The distribution parameters of process metrics have been translated into process capability, which evolved in the 1990s into the Six Sigma methodology in a number of incarnations. However, all techniques derived for SPC have two important weaknesses: they assume that the process metric is in a steady state and they assume that the process metric is normally distributed, or can be converted to a normal distribution.  The concepts and ideas outlined here make it possible to overcome these two shortcomings. The method has been developed and validated in collaboration with Josep Ferrandiz.  
This is the fourth post in the series on Predictive SPC.

Definitions are here.

In Part 1, we covered the Axioms and Assumptions (A&As) of SPC, explained why they are wrong, but useful.  In Part 2, we talked about the key concepts and ideas of SPC,  and Part 3 had a discussion of how the key SPC concepts and ideas will change if we consider non-stationary processes.  Now we are beginning to bring it all together

Bringing it All Together

We have established

  1. That the body of work related to Statistical Process Control (SPC) can very well be used for stationary (steady-state) data.
  2. That these methodologies can be used to identify outliers, loss of stationarity, and other deviations from what is considered normal data behavior ( “Normal” in this context does not imply normality of distribution).
  3. That these methods can not, and should not, be used to identify anomalies in nonstationary data.

Predictive SPC to the Rescue

By “Predictive SPC”, we refer to a wide class of methods that we have developed to gauge variability in nonstationary processes.  Our contribution to the development of these methods is in combining the philosophy and methodology of classic SPC with more recent advanced statistical techniques that have been introduced by researchers over the last 25 years and recently implemented in a number of statistical software systems, both general-use applications, from R to SAS and SPSS, and more application-specific packages like Autobox, ForecastPro, Netuitive, and others.

The philosophy behind Predictive SPC

The idea is simple: we know that the data are not stationary, but we want to understand whether it is an artifact of a relationship with another metric whose behavior is known, and if it is, then we want to be tracking changes in this relationship.

In this section, we outline the methodology we used successfully for identifying change in behavior of Key Performance Metrics (KPM) when they were driven by one business metrics (BMI = Business Metrics of Interest).  This methodology can easily be extended to multivariate models.

It is important to mention that this methodology works for cases where a consistently predictive model connecting the KPM with the BMI (e.g., the number of queries in the system with the business transaction rate) is known to exist, and we know which is dependent, and which is the independent variable.  The goal is to determine whether the variation in the KPM (Key Performance Metric - the Dependent Variable) is related to a change in the BMI (Business Metric of Interest - the Independent, or Explanatory, Variable), and whether this variation is predicted by this relationship.

Problem Statement:

Given:

KPM = traffic in the system
BMI = Transaction Rate (can be any explanatory variable)
Both KPM and BMI have a weekly and daily seasonality.


For the illustration purposes, I am using
a real enterprise-level use case (with data generated to provide patterns similar to what one normally sees in IT production, to protect sensitive information without losing data relationships), where for ~75% of all entities, a polynomial (quadratic) relationship was known a priori to exist between BMI and KPM, as shown in Figure 3.  Other entities were known to be independent of the business metric. 

Figure 3: Relationship between Performance Metric and Explanatory Variable.  
The red, green, and blue lines correspond to the 95th, 50th, and 5th percentile of data, respectively (more details below).

Care must be taken in applying polynomial models, as the best-fitted curve for the upper percentiles may peak in the middle and go down, whereas the best-fitted curve for the lower percentiles may do the opposite.  In implementation of such models, therefore, it is important to have a conditional model substitution for such data.  Linear or exponential models may be better predictors in such cases, even if their correlation coefficient is not as good as for polynomial.

If the model is unknown, it can be established using the methodology described in [Alexander Gilgur, Michael Perka (2006).  A Priori Evaluation of Data and Selection of Forecasting Model – presented at the 32nd International Computer Measurement Group (CMG’06) Conference, December 2012, Reno, NV] to find the best fitted Least Squares model by using the Fisher-transformed correlation coefficient as the criterion of "best fitted".

Other, application specific, methods can be used to establish the “right” model (yes, we are aware of George E.P. Box’s famous statement that “All models are wrong; some models are useful”). Such methods are outside the scope of this post.

Task:

In order to help developers identify issues in software scalability, establish a process to detect changes in behavior of Concurrency.

Solution Plan:

  1. Build regressions based on the data in a baseline time range and the latest time range.  Since the data vary weekly, there should at least be two weeks’ worth of data in each data set, in order to be able to correct for weekly variation.
  2. For a given set of BMIs (explanatory variables), compute the model predictions using models built for each of the data sets.
  3. Compare the model predictions at given BMI values of interest.

Caveats:

  1. Nonlinear nature of the relationship between the dependent and independent variables.
  2. Data are seasonal (daily and weekly), but also the relationship between the BMI and the KPM may behave differently at different times of the day and on different days of the week; in other words, the regression model has to be 3-dimensional:

KPM = f (BMI(t, d), t, d) (6)

where t = time of the day, and d = day of the week.

  1. Data cannot be assumed to be normally distributed (e.g., if BMI is transaction rate, it is a Poisson-like variable, and if KPM is the total number of queries in the system, then the KPM is an Erlang process: according to Little’s Law, total number of queries (concurrency) is a product of a Poisson distribution of request arrivals and an exponential distribution of request processing times).

A more in-depth process complexity analysis is given in, e.g., our paper presented at CMG’12 [Alex Gilgur,Josep Ferrandiz, Matthew Beason. (2012) Time-Series Analysis: Forecasting + Regression: And or Or? – presented at the 38th International Computer Measurement Group (CMG12) Conference, December 2012, Las Vegas, NV.].

Solution to the Caveats:

  1. Use quantile regression: because of the skewed (non-Gaussian) distributions of the two metrics in our case, we know that the upper (left) tail of the KPM will behave differently from the lower (right) tail of the distribution.  Quantile regression captures this difference in behavior.
  2. Derive a formula that will be a non-parametric version of the same formulae that are used in “classic” SPC.

Quantile regression offers a self-consistent, and outlier-independent, way of tracing the behavior of, e.g., the 95th percentile, the 50th percentile (median), and the 5th percentile of data.  

Other percentiles can be used as well.  Thus, if we use quantile regression for the 25th and he 75th percentiles of the data, we can easily construct a system for detecting outliers in the dependent variable (KPM = Concurrency). With outliers identified, we can then derive a stationary outlier tracking variable (number and magnitude) that will allow us to easily compare the baseline and the new data sets.  While we have successfully implemented this method as part of the implementation of Predictive SPC, further in-depth discussion is outside the scope of this paper.

The Formula

The formula that we derived is based on the same principles as the Z-score (3): the Z-score tells us how different, in standard deviations, the mean is from the target.  We want to know how different a central measure of the KPM (median) now is from what it used to be during the baseline time frame.  This has to be relative to the (5%-95%) data range.

In other words, if, e.g., at a given value of BMI - and on a given day at a given hour - the new KPM median changed by more than 3 times its range, it means that the process has changed.  If it changed by less than 1% of the KPM range, then we can say that the change is not significant.  

However, the range may not necessarily be constant.  To account for that, we propose using the average of the two ranges in the formula for the measure that we used as a process consistency.


We propose a W-score for the measurement of consistency of nonstationary processes, defined as follows:

For a given set of (BMI, T, D),

W = 2 ( M2 - M1)(R1 + R2) (7)
where
W = the process consistency score.
M1, M2 = KPM medians in the baseline (1) and in the latest (2) data sets.
R1, R2 = KPM ranges in the baseline (1) and in the latest (2) data sets.
T = Time of the day.
D = Day of the week.

The ranges are defined as the difference between the 95th and the 5th percentiles of the KPM (the dependent variable that we are monitoring).

The values of BMI at which the W is measured can be preset as a process parameter. The day of the week and time of the day at which the W is measured can be preset as process parameters as well.
It must also be noted that where the range of the dependent variable is the measure of the process quality, we should use the opposite of Eq. (7):

W = 2  (R2 - R1) (M2 + M1) (8)

Because we are not using parametric methods, we can directly subtract the ranges.

Note that the R2, R1,  M2, and M1  are all measured as values predicted by the regression at given values of the explanatory variables (in our case, BMI, time of the day, and day of the week).

Is there another way?

Yes, there is.  The formulae (7) and (8) that we are proposing are very efficient in bringing the SPC problem for non-stationary data into the realm where it becomes similar to problems already being solved by the classic SPC.  If, however, we are interested in finding the deeper patterns, we can use other non-parametric methods.  

In particular, the Kolmogorov-Smirnov (KS) test, as well as the Chi-Square Goodness of Fit (GoF) test, come in handy if we want to see how much overlap there is between the baseline and the new data distributions, and then use the p-value to determine to what degree the two distributions are still overlapping.

The non-predictive nature of these tests can easily be overcome by combining the Predictive SPC methodology with the KS or Chi-Square tests: if we slice the data by a sufficient number of quantile-regression lines, and apply the GoF tests at the BMI values of interest, we will significantly improve the power of the method and will be able to answer the questions of confidence more reliably. Specifics of the Kolmogorov-Smirnov test do not allow it to be used with discrete data, as the p-value will only be asymptotic in case of ties.  There are ways to overcome this circumstance, but they are outside the scope of this paper.  Also, the Chi-Square test does not have such limitations.
.

Outliers

Outliers (see, e.g., from here to here), from the SPC perspective, are defects - points on the timeline where the process metric did not behave “as predicted”.   With stationary data, the standard Tukey’s method works very robustly:

Find the first (Q1) and fourth (Q4) quartiles (25-percentiles); compute the interquartile range as IQR = Q4 - Q1 and then points outside the range from Q1 - 1.5 * IQR to Q4 + 1.5 * IQR are outliers.

With quantile regression, the same method can be extended to non-stationary data, as long as there is a variable correlated with the process metric whose behavior is known (e.g., time of the day, day of the week, day of the year, or number of business transactions processed, or any other variable):
Find the quantile regression predictions for  the first (Q1) and fourth (Q4) quartiles; compute the interquartile range as IQR = Q4 - Q1and then points outside the range from Q1 - 1.5 * IQR to Q4 + 1.5 * IQR are outliers.

Proceed with Caution

One important thing is that, since we are looking at a multivariate quantile regression, it becomes complicated: an outlier by the BMI may not be an outlier in day of the week, or hour of the day - or vice versa.


We have brought it all together. Stay tuned for Post 5 - Method Illustration!