AlexOnSimAndData: 2010

Tuesday, July 6, 2010

More on Monte-Carlo simulations and scalability

In one of my earlier posts (May 22, 2010), I talked about using Monte-Carlo simulation for evaluating system capacity using the Universal Scalability Law. This post is an extension of that train of thought.

A very important and convenient feature of the USL model is that its parameters α and β can be evaluated by running a controlled experiment, passing a number of instructions through the system of different numbers of parallel processors and measuring the time it took the system to process these instructions.

(Disclaimer: All charts, diagrams, and values in this post are not based on actual collected data and are provided for illustrative purposes only. All the data were generated and processed using the R programming language)

When we run a controlled experiment, we have a fixed number of processors and a fixed set of tasks, and we measure the throughput. Everything is fixed, and we can quickly calculate the parameters.

However, when we are in a production mode, we may not have such luxury. That will definitely have a negative effect on our ability to get the α and β as deterministic numbers.

We are measuring from the production system whatever we can get, and as a result, we get a distribution of possible αs and a distribution of possible βs of the Universal Scalability Law model.

In this case, we can measure the throughput of a group of processors (a server, or a server pool) over time, collecting the data on throughput and the number of processors in operation on this group of processors at the time that such throughput was measured.

Then from these data, we can estimate the α and β parameters, by using, e.g., the (modified to fit the formula) Multivariate Least-Squares method, or Max-Likelihood Estimation, or any other suitable way to do that. These stochastic methods allow estimating the significance of the model parameters, but more importantly, their distributions.

Once we have these parameters – as probabilistic values – and the number of processors – as a random value with a (uniform?) distribution– we can run the Monte-Carlo simulation and identify the likeliest values of the system throughput, as well as potential pitfalls.

Generally, a regression cannot be extended beyond the range over which it was built. But the universal part of the Universal Scalability Law implies that it can be used with the same parameters for systems whose size and complexity extend beyond the system on which these parameters were calculated.

Thus, keeping the same α and β distributions, and running a Monte-Carlo simulation on a number of processors P extended beyond the original 4-40 range to a range of 32-400 gives the following USL curve:

By running the Monte-Carlo simulation, we can also get the distribution of the Throughput. In the particular example considered here, we have a distribution very strongly skewed to the right (higher end of the throughputs).

We also see that the throughput will have a range of achievable values, which is determined by the α and β distributions calculated above, as well as the distribution of the number of processors running simultaneously on the system under consideration.

It is interesting to look at the throughput curves at various distributions assumed for the numbers of processors running. The plots above described a uniform distribution, meaning that there is a range of numbers of processors, and at any time, each of this numbers has an equal chance of being operational; in other words, the load balancers are working properly.

Alternatively, we may consider a binomial model, wherein a given fraction of the pool of processors is operational at any time. It is a discreet equivalent of the normal distribution, and it is intuitively obvious that the system will behave differently in this model.

Monte-Carlo simulation shows that, indeed, for a binomial distribution of the number of processors running in the group, USL describes a very different behavior:

Here we have 400 processors, of which 25% (100 processors) are operational at any time. All other parameters are the same as in the uniform case analyzed above.

Interestingly enough, replacing that with 110 processors, of which 90% are operational at any time (again the same 100 processors), gives an even different behavior:

It is important to understand that, even if the data behavior doesn’t seem to support the Universal Scalability Law model, it only means that the distribution of the number of servers is non-uniform: these pictures were obtained by using the USL equation.

These last two plots do not mean that it is very difficult to get the USL model parameters by using data obtained from a non-uniform distribution of the number of servers in operation: we assume that the uncertainty in the α and β values comes from us not knowing the exact numbers and not from any inherent indeterminacy of the contention and coherency terms.

In the case of binomial distribution of parallel servers running on the task, we have a nearly normal distribution of throughput (Compare it with the earlier throughput histogram for uniform model of the running processors on the right)

It seems reasonable to theorize that the shape of distribution of a multi-processor system throughput is determined by the way system is operating, and we can, by looking at the historical histogram of throughput for a system, determine the model describing the system’s load balancing operation.

As a side note, if the α and β distributions are not normal, we will have a different shape of throughput distributions.

Thursday, June 17, 2010

Poisson, the moon, seasonal changes, and forecasting

A very special case of non-normal distribution is the Poisson distribution. I call it special because it can be used in analyzing data for seasonality (any periodic pattern is by convention called "seasonal". The reasons why it is so are hidden behind the shroud of mystery. I honestly tried to understand them, but failed; so I just joined the crowd that says that seasonal patterns don't have to follow the annual cycle, and left it at that.)

In the ideal world, if there is a 7-day (or a 3-month, or a 4-week) pattern in the data, we can count on this pattern repeating consistently. Moon phases (and consequently the tidal movements of water) would be a good example of that. For cases like this, Fourier analysis works wonders, allowing us to analyze the data and pick out all the precise seasonalities and then sort them in the order of magnitude (amplitude) and pick the most significant ones (more on significance later in this post).

But unless we are in a business related to open-sea fishery, offshore oil drilling, lighthouse operations, or any other where lunar phases are important, most of the time we do not have such a clear-cut seasonal pattern: a maintenance event may be off by a day; periodic deliveries may not always happen on time; task processing may take longer than it usually does; etc.

In these cases, we see see outliers (events) following an approximately periodic pattern, with clusters of data between the outliers. We can view this problem from two sides:

1. We can analyze the intervals (sizes of these clusters) and, if it follows Erlang distribution, we may be able to predict the amount of traffic based on the queueing theory. Erlang distribution, however, is not very simple in data analysis (albeit very useful), as the correct formula depends on what is done to the blocked calls (packets): they may be aborted or queued. For more on Erlang distribution, see http://en.wikipedia.org/wiki/Erlang_distribution

In addition, not all situations can be reduced to a queueing theory application, and sometimes a more generic approach is needed.

2. We can look at the problem from the other end and sort the cluster sizes, with the assumption that some fuzziness is expected: e.g., an interval of cluster sizes of 5-8 days is weekly, whereas 26-35 is monthly, etc. That done, we can analyze the numbers of outliers (events) that occurred during these intervals, and if the most 'prominent' period (the mode of the cluster size distribution) has had numbers of events following the Poisson distribution, then we have a seasonality. This seasonality can then be used as a parameter in forecasting the data.

Wednesday, June 2, 2010

"Abnormal" distributions: just a reminder

The Central Limit Theorem (CLT) is very well known: no matter what the distribution of the population, random samples of sufficient size taken from this distribution will have their means normally distributed if the number of samples is "big enough". A lot has been done in statistics to describe the normal distribution and to use it, most notably, the Z and T tests, the F test and the entire ANOVA; even the binomial distribution of discrete outcomes (success/failure) of the Bernoulli-test variety has been proven to be approximated by the normal distribution, which allows us to use the Z (T) tests to judge improvement or degradation in the reliability presented as Defects per Million Opportunities, aka DPMO.

But is the Normal distribution really that important? What if we are sampling data from a process with a random component, and we know that there is no way that we can guarantee the randomness of the samples?

It is great when we are running a Hadoop-type application, with millions of data rows coming in. We can sample all we want in any way, shape, or form, and force the distribution to be Normal by fulfilling the CLT conditions.

But what if we can only measure once a day? The condition of the CLT is not true anymore, then. Does it mean that our monthly distribution of samples cannot be judged as Normal? What can we do to use statistical analysis on such data?

One way around the problem is to group the data into clusters and then to take random data out of the clusters. The formality is preserved, but the quality of the data suffers. If we have 28 data points (1 for each day), and we cluster them into weekly groups, we will end up with a group of 4 samples, and it doesn't really matter that the means of these 4 samples follow the Normal distribution. To make things even more desperate, who knows that these 4 samples will correspond to the weekly data variations? What if we are trying to track the weekend variations, but we start the 28-day group on a Wednesday? We have just given away 1 (out of only 4!) seven-day sample by splitting it into Wednesday-Saturday and Sunday-Tuesday periods.

To make the long story short, we cannot always have a Normal distribution. Then how do we make the judgment calls on process improvement, on capacity sufficiency, on difference in processes?

Parenthetically, the funny thing about the T-test it is that Student's T is NOT Normal, but at large sample sizes (high numbers of degrees of freedom), T can be approximated by the Z distribution, which is Normal. But it is the T-test, NOT a Z-test, that is to be used when comparing two means or comparing a mean with a value. Then why are we so worried about Normality in checking the equality of means? Because at DF > 30 (DF == degrees of freedom), we have to use the Z-table. So we say that the assumption is that the samples are drawn from a normal distribution.

Sadly, we rarely check that this assumption is true, especially when performing time-series analyses, and we rarely Normalize the variable we are looking at. For example, if we are dealing with Rho - the Pearson's correlation coefficient - we can use the Fisher's transformation to make it approximately Normal and then to use the Normalized values to decide how well the model fits the data (one-sample T-test) or how different the two models are from each other, and which is better (two-sample T-test).

For the "typical" analyses, simpler Normalizing transformations (power transformations, roots, etc) can be used.

Special care must be exercised when going from values to their derivatives (and vice versa): in the general case, Normality does not hold for such transformations. Most importantly, the sum and the product of two normal distributions is NOT normal, unless their means are equal (for product, even then it will most probably not be normal).

One special distribution is the Exponential distribution. It is heavily used in queueing theory, due to its ready availability for the description of Poisson processes and Markovian chains, and in reliability analysis, because MTBF (mean time between failures) is Exponentially distributed, and the Reliability function is defined using the MTBF, such that we can use the Chi-Square distribution to evaluate an improvement or degradation in reliability, if we know the change in MTBF.

As a rule, do NOT Normalize Exponentially-distributed data: this distribution is too important for that!

For a great outline of Normal and non-Normal distributions and tests, follow this link:
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf

Tuesday, May 25, 2010

More on Monte-Carlo simulation: if we don't know the equation

In my previous post, I was discussing the beauty of using Monte-Carlo simulation when we can describe the world by a formula. Sadly, more often than not, we can't. The beauty of Monte-Carlo analysis is in its suitability for such cases. If we can describe the world in a set of nodes and links (aka states and events), and if we know the probability distribution of the states, or the timing and magnitude of the events, and we want to know the maximum-likelihood of an outcome, we do not have to use an analytical model: the node-and-link grid has all the solutions in it. Some of these solutions are expressed as an equation or set of simultaneous equations (again); others, a parsing algorithm (akin to the parsing algorithm for an N-ary tree).

Any node-and-link model involves activities (akin to decision making) on nodes, connected by links. In some cases, such links will be physical; e.g., traffic simulation, where the highways will be represented by links, and nodes will correspond to intersections, stores, towns, etc. Traffic density, delays, link traversal times, link and node capacities would be the independent variables for the "provider" nodes and links and dependent variables for the "receiver" nodes and links.

In other cases, links are logical, belonging to the "IF...THEN...ELSE" category. Probability distributions of decisions at each node will be the metrics of interest in this case, along with the other parameters, like retention time at the node, link delay, etc.

A Closer Look at the Node-and-Link Model Structure from the Monte-Carlo Standpoint.

A fragment of a traffic network represented by nodes and links is shown in Figure 1 (left). "Vehicles" (anything from cars to messages) traverse the nodes via links, each node and link having their intrinsic properties, like retention time; capacity; throughput, reliability, etc.

A traffic network model can just as well simulate a decision grid by introducing a virtual “crawler” parsing the network. By sending multiple crawlers down the network and simulating each crawler's retention times and decisions at each node, we will build the total distribution of the network latency.

The diagram on the right illustrates the concept. Nodes are decision-making points, where the crawlers are detained for time periods that can be described by known distribution (see the histograms). At decision nodes (e.g., N2_2 in this diagram), each link is associated with a given frequency of selection (the binary histogram next to it).

The beauty of Monte-Carlo simulation is that it can do fast-time modeling, simulating the time faster than it happens in the model's real-world protagonist system.

It can be achieved in a number of ways, most popular of which are:

1. Real-time: when running human-in-the-loop simulations, with actual humans making some of the decisions, as well as in psychology and social sciences research, real-time Monte-Carlo situation is irreplaceable, as it allows subjects to make decisions in the actual tempo of the world they are used to. It is also very popular in the game business, with SimCity and Earthquake being among the earliest adopters of this approach, where the user is in control of the simulated world.

Interactive online games are the next step in the evolution of the real-time Monte-Carlo simulation, approach, where users are in control, and all the hundreds (thousands?) of users, by virtue of playing the game, become the simulation. The downside of using interactive-game environment in any research is that the number of users varies, and that will likely make any data collected from such a setup too noisy to be useful.

2. Time-scale: the simulated time delays are actually waited out on a scale by the modeling computer. This method allows building very high-fidelity event-and-process models; however, it may not be very efficient for modeling situations that take a long time to complete without much happening; e.g., a simulation of a day of operations of a police station in a quiet neighborhood of Beverly Hills, CA would be too boring to analyze and it would take way too long for any analyst's liking, even if the scale is 1:24, allowing a day in simulated time to pass in 1 hour or the simulation's runtime. Another drawback of time-scale simulations is in their dependency on the simulator running: a stoppage in the operation of the program software or hardware will cause a disconnect between the simulated time and real time, which time-scale modeling developers need to be aware of and account for in their disaster recovery and exception handling practices.

3. Fast-time: this is yet another change of paradigm with regard to what the nodes and links represent. (See illustration below)

Each crawler (aka Agent) knows what the timestamp of its next event will be. The event may be any change of the crawler's current state (spatial location, heading on the compass, number of friends on Facebook, job status, citizenship, oil pressure, speed, funding status, etc).

A master controller keeps the simulation time and a sorted collection of agents in the order of next event times. All agents act independently or in interaction with each other via the rules defined for them in the simulation program or configuration files. Each agent only exposes its functionality and fields that it needs other agents to be aware of. In addition, each agent may have a list of agents that need to be updated with it, in an Observer type of design pattern.

The inherent flexibility of this simulation, as well as its ability to run faster than real time while the same time making more productive use of the hardware than does the time-scale or real-time simulation makes it the key to efficient discrete-event simulation and the paradigm of choice for many analyses.

Saturday, May 22, 2010

More on Monte-Carlo simulation: If we know the equation.

There are lots of examples of using Monte-Carlo analysis in situations when we have the relationship between the independent variables (factors) and the dependent variable(s).

For example, let's consider a performance-risk-analysis problem for a given multiprocessor system, and suppose we know that this system is described by the universal scalability model derived by Dr. Neil Gunther. (The latter is a very reasonable assumption, considering that the USM is known to be perhaps the most valid representation of multi-processor capacity. Suppose, further, that we have been able to get enough throughput measurements to derive the parameters (σ and k) for this system's model:

C(p)

1 + α (p − 1) + β p (p − 1)

(1)

(Neil, thank you for the HTML !)

In this formula, C is the throughput, or capacity of the system; α is the contention component, and β is the coherency component of the system in question. The p is the number of processors.

We are asked to predict whether the system will be able to handle the workload given a probability distribution of p = the number of processors operational on this system at any given moment in time (it may just as well be switches, routers, controllers, etc.).

In mathematical terms, we are asked to determine the distribution of the system's throughput (capacity) if the number of processors follows a certain probability distribution.

(This is inserted on July 5, 2002:)In addition, when we are calculating the model parameters α and β, we can only do so with some degree of uncertainty. That implies a distribution of possible values for the parameters, with the calculated numbers being, hopefully, the mean and mode of these distributions. In the best-case scenario, we can assume a normal distribution with a standard deviation of zero, in which case the only probabilistic term is the number of processors in operation.
More in a blog due next week (hopefully)
(End of July 5, 2002 insertion)

We have the formula and the distribution for the p; we vary the p by rolling the dice out of this distribution and calculate the values for the system's capacity, C(p). The more times we roll the dice, the more accurately the data will resemble the distribution of the capacity for this system.

Parenthetically: generally speaking, when applying the Monte-Carlo or any other analytical technique, care must be taken that the distribution of p should be reasonably within the range on which the parameters were determined. In the case of USM, however,this comment may be considered of low relevance.

We then run statistical tests to determine whether the C(p) that we calculated will meet the demands for the system's capacity.

What tests to run, and how to interpret the results, is the subject of another post.

Thursday, May 20, 2010

Discrete-event simulation and capacity planning

When we have historical time-series data and want to plan the system's capacity (and it doesn't matter if we are talking about a storage, network, airspace, or classroom system; any non-isolated system that needs a way to store data, signals, students, airplanes, etc.), we know what to do: see my previous post.

In this case, we analyze the data, generate forecasts, and analyze the forecasts with reference to installed capacity to determine the forecasted system resource utilization, which will lead us to identifying capacity needs for the future.

Failure Analysis

There is, however, a class of problems where we want to predict the system's capability to perform without error, but errors are rare events that have no right to happen. Traffic safety comes to mind as an example of a problem of this class. We may observe such a system till we turn blue in the face and notice no failures. Does it mean that the system is safe?

The Nature of System Failures

By failure, I refer to any kind of the system not being adequate to the task or not performing as specified. It can be a defect rejected from a production line, or a gridlock on a network, or a crash. A process parameter outside the control limits, too, can be treated as a failure.

Failures are (hopefully) rare events distributed according to the Poisson distribution, which describes the frequency of rare events in a given time period, group of entities, etc. For these, time-series analysis is not the right tool. A discrete-event simulation is more adequate to the task.

About Discreet-Event Simulation:

Any discreet-event simulation (aka DES) operates on the principle of abrupt transition from state to state. An event is merely the other name for such transition. It is characterized by the timestamp, previous state, and new state.

When the simulation has processed the System State Before Event (I will call it SSBE here), it immediately advances the simulation clock to the timestamp of the System State After Event (SSAE). Of course in the real world, the real-time clock keeps ticking, and there is some time-scaling going on, as opposed to truly discreet transition; but for the purposes of the simulation this approach nevertheless allows to model real world faster than they really happen. That's why the other name for discreet-event simulation is Fast-Time simulation.

Validation and Calibration:
However, it is critically important that the model be properly validated, based on the actual historical data, for reasons I am going to talk about later.

Once a model is built for the simulation, it is run in the Monte-Carlo manner, repeating the runs multiple times and analyzing the outcomes for deviation from the historical data. Regular T-test (Z-test) can be used for each of the critical parameters of the model. If any parameters have not been validated, it is possible that the model will have to be modified, and then the entire cycle will have to be done again.

This is the basic validation process.

Its outcome - a simulation properly validated - can be called benchmarked for the base use case.

In the ideal world, we have data for multiple use cases. If so, we can not only validate the model, but also calibrate it against a range of conditions.

Monte-Carlo and failure analysis

Now, if we have ranges and distributions for the independent variables, we can vary them randomly ("rolling the dice") and run the simulation, getting the dependent variables. This process is called - surprise!- Monte-Carlo simulation. Assuming that the data that we validated the model with adequately represent the system, we can then, based on the outcome of the model, predict how the real system will likely behave. All system failures, no matter how unlikely, will inevitably appear with sufficiently large number of iterations.

Monte-Carlo and Robust Experimentation
If we have been able to calibrate the model, too, then we can run all kinds of robust experiment designs, covering the range of independent variable values that correspond to the two conditions at which the model was calibrated. The beauty of it is, the experiments are being conducted on the model, without the need to take the real system out of operation for the duration of conducting the experiments.

More on design of experiment, robust design, etc. is coming in one of my next posts.

Wednesday, May 19, 2010

About capacity planning

On March 31, ComputerWorld ran an article "How to develop an effective capacity planning process"

(http://www.computerworld.com/s/article/9174422/How_to_develop_an_effective_capacity_planning_process)
Excerpt:
* 1. Select an appropriate capacity planning process owner.
* 2. Identify the key resources to be measured.
* 3. Measure the utilizations or performance of the resources.
* 4. Compare utilizations to maximum capacities.
* 5. Collect workload forecasts from developers and users.
* 6. Transform workload forecasts into IT resource requirements.
* 7. Map requirements onto existing utilizations.
* 8. Predict when the shop will be out of capacity.
* 9. Update forecasts and utilizations.
"

It's all very true, but in the ideal world. In particular, the Item 5 is not very likely to happen in the real world, for a number of reasons.

First, workload forecasts are normally derived from the general amount of work in the corporation - private information usually kept close to heart by the corporate business wizards, as knowledge of such info is classified as insider information.

Second, developers and other IT users will at best have an understanding of what such information is being collected for, but it is in the farthest backburner of their minds and typically will be forgotten right after you talk to them about it: they have more pressing issues to worry about, and developers want to optimize the resources they have, anyway.

Third, transformation from workload prediction information to capacity utilization is an ill-defined problem, the only possible solution to which is, "it depends".

Even such a seemingly trivial problem as finding the correlation between corporate headcount and utilization of the Exchange server capacity (storage, performance, or both) will never follow a single formula, but will depend on a number of conditions, from Exchange server configuration, allocation, and configuration to corporate policies, etc.

When the company is using a SAN cloud, MapReduce in any of its incarnations, or any other distributed system, such transformation of developer workload forecasts into corporate IT utilization speculations very quickly degenerates into an exercise in Professor McGonagall's art of transfiguration and results in what Scott Armstrong called, "Bold Freehand Extrapolation".

Such forecasts cannot really be used to do serious data-driven capacity planning.

The right approach would be to collect the actual usage data from the IT network, and once sufficient amount of data have been collected, calculating a Time-Series-Analysis (TSA) forecast on each piece of hardware for which usage data have been collected.

This approach is characterized by 5 very important features:

1. It reduces the guesswork to the bare minimum and almost makes it non-arbitrary: forecasts become data-driven.
2. Assuming the right tools have been used for forecasting, it produces accurate and reliable forecasts
3. It makes its user aware of the uncertainty inherent in forecasting: when we say that we will use X amount of resource, we choose to ignore the uncertainty; when we use a statistical tool to come up with the X, the uncertainty comes up, and we have to deal with it - which is an inconvenience, but a good thing: it keeps us and our partners honest.
4. It gives us a way to judge the quality of the forecast.
5. It lends itself easily to multi-level forecasting, allowing us to build up forecasts for higher levels of integration from the lower levels of granularity.

All of this together is a very convincing reason why we should not rely on users' and developers' information in capacity planning, but instead use real forecasts.

Monday, May 17, 2010

How it all started: Part 1 / 4 (for the remaining 3 parts, click Comments)

Allow me to introduce myself: AlexG. I am originally from Moscow, Russia; have been living in California (Bay Area) over the last 19 years. I have a family, kids, and a great interest in life and many different things it offers.

I've got an excellent graduate-level education in Applied Math, Control System Engineering, and Operations Research, among other things. This education carried me along into getting a professional engineer (PE) license and into Semiconductor Manufacturing, Six Sigma, and real Data Analytics, creating software for Data Mining, Fast-Time Simulations, and Statistical Forecasting.

Along the way, I generated papers and patent applications in every domain I touched, from waste treatment to data mining.

I joined EG&G Amorphous Silicon (which is now Perkin-Elmer Optoelectronics) in 1995 to be a Process Engineer in a semiconductor production startup environment. A couple of years later, I was picked by the management for Six Sigma Black Belt training, and upon completion of the course, became the local expert in data-driven process improvement.

I was involved in a number of DMAIC ("Define, Measure, Analyze, Improve, Control" - the Six Sigma mantra, aka GEMS - "Gather, Examine, Modify, Sustain" - the GE Medical Systems' Six Sigma mantra) projects, aimed at improvement of the consistency and quality of the critical to quality (CTQ) process parameters driving the bottom-line.

In that capacity, I realized two very important things:

1. That it is impossible to be successful in data-driven process improvement without using specialized software for data analytics.

2. That it is very expensive and silly to do experiments on live production lines. The cost of this may offset the benefit of the knowledge and control over the CTQ parameters, driving the ROI into the ground.

These two facts compelled me to ...

(To read on, click the Comments link)