Tuesday, July 6, 2010

More on Monte-Carlo simulations and scalability

In one of my earlier posts (May 22, 2010), I talked about using Monte-Carlo simulation for evaluating system capacity using the Universal Scalability Law. This post is an extension of that train of thought.

A very important and convenient feature of the USL model is that its parameters α and β can be evaluated by running a controlled experiment, passing a number of instructions through the system of different numbers of parallel processors and measuring the time it took the system to process these instructions.


(Disclaimer: All charts, diagrams, and values in this post are not based on actual collected data and are provided for illustrative purposes only.  All the data were generated and processed using the R programming language)

When we run a controlled experiment, we have a fixed number of processors and a fixed set of tasks, and we measure the throughput. Everything is fixed, and we can quickly calculate the parameters.


However, when we are in a production mode, we may not have such luxury. That will definitely have a negative effect on our ability to get the α and β as deterministic numbers.


We are measuring from the production system whatever we can get, and as a result, we get a distribution of possible αs and a distribution of possible βs of the Universal Scalability Law model.


In this case, we can measure the throughput of a group of processors (a server, or a server pool) over time, collecting the data on throughput and the number of processors in operation on this group of processors at the time that such throughput was measured.


Then from these data, we can estimate the α and β parameters, by using, e.g., the (modified to fit the formula) Multivariate Least-Squares method, or Max-Likelihood Estimation, or any other suitable way to do that. These stochastic methods allow estimating the significance of the model parameters, but more importantly, their distributions.




Once we have these parameters – as probabilistic values – and the number of processors – as a random value with a (uniform?) distribution– we can run the Monte-Carlo simulation and identify the likeliest values of the system throughput, as well as potential pitfalls.


Generally, a regression cannot be extended beyond the range over which it was built. But the universal part of the Universal Scalability Law implies that it can be used with the same parameters for systems whose size and complexity extend beyond the system on which these parameters were calculated.


Thus, keeping the same α and β distributions, and running a Monte-Carlo simulation on a number of processors P extended beyond the original 4-40 range to a range of 32-400 gives the following USL curve:


By running the Monte-Carlo simulation, we can also get the distribution of the Throughput. In the particular example considered here, we have a distribution very strongly skewed to the right (higher end of the throughputs).

We also see that the throughput will have a range of achievable values, which is determined by the α and β distributions calculated above, as well as the distribution of the number of processors running simultaneously on the system under consideration.

It is interesting to look at the throughput curves at various distributions assumed for the numbers of processors running. The plots above described a uniform distribution, meaning that there is a range of numbers of processors, and at any time, each of this numbers has an equal chance of being operational; in other words, the load balancers are working properly.


Alternatively, we may consider a binomial model, wherein a given fraction of the pool of processors is operational at any time. It is a discreet equivalent of the normal distribution, and it is intuitively obvious that the system will behave differently in this model.


Monte-Carlo simulation shows that, indeed, for a binomial distribution of the number of processors running in the group, USL describes a very different behavior:





Here we have 400 processors, of which 25% (100 processors) are operational at any time. All other parameters are the same as in the uniform case analyzed above.


Interestingly enough, replacing that with 110 processors, of which 90% are operational at any time (again the same 100 processors), gives an even different behavior:


It is important to understand that, even if the data behavior doesn’t seem to support the Universal Scalability Law model, it only means that the distribution of the number of servers is non-uniform: these pictures were obtained by using the USL equation.


These last two plots do not mean that it is very difficult to get the USL model parameters by using data obtained from a non-uniform distribution of the number of servers in operation: we assume that the uncertainty in the α and β values comes from us not knowing the exact numbers and not from any inherent indeterminacy of the contention and coherency terms.


In the case of binomial distribution of parallel servers running on the task, we have a nearly normal distribution of throughput (Compare it with the earlier throughput histogram for uniform model of the running processors on the right)


It seems reasonable to theorize that the shape of distribution of a multi-processor system throughput is determined by the way system is operating, and we can, by looking at the historical histogram of throughput for a system, determine the model describing the system’s load balancing operation.

As a side note, if the α and β distributions are not normal, we will have a different shape of throughput distributions.