Predictive SPC. Part 4
Statistical Process Control (SPC) is a well described framework used to identify weak points in any process and predict the probability of failure in it. The distribution parameters of process metrics have been translated into process capability, which evolved in the 1990s into the Six Sigma methodology in a number of incarnations. However, all techniques derived for SPC have two important weaknesses: they assume that the process metric is in a steady state and they assume that the process metric is normally distributed, or can be converted to a normal distribution. The concepts and ideas outlined here make it possible to overcome these two shortcomings. The method has been developed and validated in collaboration with Josep Ferrandiz.
This is the fourth post in the series on Predictive SPC.
Definitions are here.
In Part 1, we covered the Axioms and Assumptions (A&As) of SPC, explained why they are wrong, but useful. In Part 2, we talked about the key concepts and ideas of SPC, and Part 3 had a discussion of how the key SPC concepts and ideas will change if we consider non-stationary processes. Now we are beginning to bring it all together
Statistical Process Control (SPC) is a well described framework used to identify weak points in any process and predict the probability of failure in it. The distribution parameters of process metrics have been translated into process capability, which evolved in the 1990s into the Six Sigma methodology in a number of incarnations. However, all techniques derived for SPC have two important weaknesses: they assume that the process metric is in a steady state and they assume that the process metric is normally distributed, or can be converted to a normal distribution. The concepts and ideas outlined here make it possible to overcome these two shortcomings. The method has been developed and validated in collaboration with Josep Ferrandiz.
This is the fourth post in the series on Predictive SPC.
Bringing it All Together
We have established
-
That the body of work related to Statistical Process Control (SPC) can very well be used for stationary (steady-state) data.
-
That these methodologies can be used to identify outliers, loss of stationarity, and other deviations from what is considered normal data behavior ( “Normal” in this context does not imply normality of distribution).
-
That these methods can not, and should not, be used to identify anomalies in nonstationary data.
That the body of work related to Statistical Process Control (SPC) can very well be used for stationary (steady-state) data.
That these methodologies can be used to identify outliers, loss of stationarity, and other deviations from what is considered normal data behavior ( “Normal” in this context does not imply normality of distribution).
That these methods can not, and should not, be used to identify anomalies in nonstationary data.
Predictive SPC to the Rescue
By “Predictive SPC”, we refer to a wide class of methods that we have developed to gauge variability in nonstationary processes. Our contribution to the development of these methods is in combining the philosophy and methodology of classic SPC with more recent advanced statistical techniques that have been introduced by researchers over the last 25 years and recently implemented in a number of statistical software systems, both general-use applications, from R to SAS and SPSS, and more application-specific packages like Autobox, ForecastPro, Netuitive, and others.
The philosophy behind Predictive SPC
The idea is simple: we know that the data are not stationary, but we want to understand whether it is an artifact of a relationship with another metric whose behavior is known, and if it is, then we want to be tracking changes in this relationship.
In this section, we outline the methodology we used successfully for identifying change in behavior of Key Performance Metrics (KPM) when they were driven by one business metrics (BMI = Business Metrics of Interest). This methodology can easily be extended to multivariate models.
It is important to mention that this methodology works for cases where a consistently predictive model connecting the KPM with the BMI (e.g., the number of queries in the system with the business transaction rate) is known to exist, and we know which is dependent, and which is the independent variable. The goal is to determine whether the variation in the KPM (Key Performance Metric - the Dependent Variable) is related to a change in the BMI (Business Metric of Interest - the Independent, or Explanatory, Variable), and whether this variation is predicted by this relationship.
Problem Statement:
Given:
KPM = traffic in the system
BMI = Transaction Rate (can be any explanatory variable)
Both KPM and BMI have a weekly and daily seasonality.
For the illustration purposes, I am using
a real enterprise-level use case (with data generated to provide patterns similar to what one normally sees in IT production, to protect sensitive information without losing data relationships), where for ~75% of all entities, a polynomial (quadratic) relationship was known a priori to exist between BMI and KPM, as shown in Figure 3. Other entities were known to be independent of the business metric.
Figure 3: Relationship between Performance Metric and Explanatory Variable.
The red, green, and blue lines correspond to the 95th, 50th, and 5th percentile of data, respectively (more details below).
Care must be taken in applying polynomial models, as the best-fitted curve for the upper percentiles may peak in the middle and go down, whereas the best-fitted curve for the lower percentiles may do the opposite. In implementation of such models, therefore, it is important to have a conditional model substitution for such data. Linear or exponential models may be better predictors in such cases, even if their correlation coefficient is not as good as for polynomial.
If the model is unknown, it can be established using the methodology described in [Alexander Gilgur, Michael Perka (2006). A Priori Evaluation of Data and Selection of Forecasting Model – presented at the 32nd International Computer Measurement Group (CMG’06) Conference, December 2012, Reno, NV] to find the best fitted Least Squares model by using the Fisher-transformed correlation coefficient as the criterion of "best fitted".
Other, application specific, methods can be used to establish the “right” model (yes, we are aware of George E.P. Box’s famous statement that “All models are wrong; some models are useful”). Such methods are outside the scope of this post.
a real enterprise-level use case (with data generated to provide patterns similar to what one normally sees in IT production, to protect sensitive information without losing data relationships), where for ~75% of all entities, a polynomial (quadratic) relationship was known a priori to exist between BMI and KPM, as shown in Figure 3. Other entities were known to be independent of the business metric.
Task:
In order to help developers identify issues in software scalability, establish a process to detect changes in behavior of Concurrency.
Solution Plan:
-
Build regressions based on the data in a baseline time range and the latest time range. Since the data vary weekly, there should at least be two weeks’ worth of data in each data set, in order to be able to correct for weekly variation.
-
For a given set of BMIs (explanatory variables), compute the model predictions using models built for each of the data sets.
-
Compare the model predictions at given BMI values of interest.
Build regressions based on the data in a baseline time range and the latest time range. Since the data vary weekly, there should at least be two weeks’ worth of data in each data set, in order to be able to correct for weekly variation.
For a given set of BMIs (explanatory variables), compute the model predictions using models built for each of the data sets.
Compare the model predictions at given BMI values of interest.
Caveats:
-
Nonlinear nature of the relationship between the dependent and independent variables.
-
Data are seasonal (daily and weekly), but also the relationship between the BMI and the KPM may behave differently at different times of the day and on different days of the week; in other words, the regression model has to be 3-dimensional:
KPM = f (BMI(t, d), t, d) (6)
where t = time of the day, and d = day of the week.
-
Data cannot be assumed to be normally distributed (e.g., if BMI is transaction rate, it is a Poisson-like variable, and if KPM is the total number of queries in the system, then the KPM is an Erlang process: according to Little’s Law, total number of queries (concurrency) is a product of a Poisson distribution of request arrivals and an exponential distribution of request processing times).
A more in-depth process complexity analysis is given in, e.g., our paper presented at CMG’12 [Alex Gilgur,Josep Ferrandiz, Matthew Beason. (2012) Time-Series Analysis: Forecasting + Regression: And or Or? – presented at the 38th International Computer Measurement Group (CMG12) Conference, December 2012, Las Vegas, NV.].
Nonlinear nature of the relationship between the dependent and independent variables.
Data are seasonal (daily and weekly), but also the relationship between the BMI and the KPM may behave differently at different times of the day and on different days of the week; in other words, the regression model has to be 3-dimensional:
Data cannot be assumed to be normally distributed (e.g., if BMI is transaction rate, it is a Poisson-like variable, and if KPM is the total number of queries in the system, then the KPM is an Erlang process: according to Little’s Law, total number of queries (concurrency) is a product of a Poisson distribution of request arrivals and an exponential distribution of request processing times).
Solution to the Caveats:
-
Use quantile regression: because of the skewed (non-Gaussian) distributions of the two metrics in our case, we know that the upper (left) tail of the KPM will behave differently from the lower (right) tail of the distribution. Quantile regression captures this difference in behavior.
-
Derive a formula that will be a non-parametric version of the same formulae that are used in “classic” SPC.
Quantile regression offers a self-consistent, and outlier-independent, way of tracing the behavior of, e.g., the 95th percentile, the 50th percentile (median), and the 5th percentile of data.
Other percentiles can be used as well. Thus, if we use quantile regression for the 25th and he 75th percentiles of the data, we can easily construct a system for detecting outliers in the dependent variable (KPM = Concurrency). With outliers identified, we can then derive a stationary outlier tracking variable (number and magnitude) that will allow us to easily compare the baseline and the new data sets. While we have successfully implemented this method as part of the implementation of Predictive SPC, further in-depth discussion is outside the scope of this paper.
Use quantile regression: because of the skewed (non-Gaussian) distributions of the two metrics in our case, we know that the upper (left) tail of the KPM will behave differently from the lower (right) tail of the distribution. Quantile regression captures this difference in behavior.
Derive a formula that will be a non-parametric version of the same formulae that are used in “classic” SPC.
No comments:
Post a Comment