Tuesday, May 25, 2010

More on Monte-Carlo simulation: if we don't know the equation

In my previous post, I was discussing the beauty of using Monte-Carlo simulation when we can describe the world by a formula.  Sadly, more often than not, we can't.  The beauty of Monte-Carlo analysis is in its suitability for such cases.  If we can describe the world in a set of nodes and links (aka states and events), and if we know the probability distribution of the states, or the timing and magnitude of the events, and we want to know the maximum-likelihood of an outcome, we do not have to use an analytical model: the node-and-link grid has all the solutions in it.  Some of these solutions are expressed as an equation or set of simultaneous equations (again); others, a parsing algorithm (akin to the parsing algorithm for an N-ary tree).

Any node-and-link model involves activities (akin to decision making) on nodes, connected by links. In some cases,  such links will be physical; e.g., traffic simulation, where the highways will be represented by links, and nodes will correspond to intersections, stores, towns, etc.  Traffic density, delays, link traversal times, link and node capacities would be the independent variables for the "provider" nodes and links and dependent variables for the "receiver" nodes and links.

In other cases, links are logical, belonging to the "IF...THEN...ELSE" category.  Probability distributions of decisions at each node will be the metrics of interest in this case, along with the other parameters, like retention time at the node, link delay, etc.


A Closer Look at the Node-and-Link Model Structure from the Monte-Carlo Standpoint.

A fragment of a traffic network represented by nodes and links is shown in Figure 1 (left).  "Vehicles" (anything from cars to messages) traverse the nodes via links, each node and link having their intrinsic properties, like retention time; capacity; throughput, reliability, etc.







A traffic network model can just as well simulate a decision grid by introducing a virtual “crawler” parsing the network.  By sending multiple crawlers down the network and simulating each crawler's retention times and decisions at each node, we will build the total distribution of the network latency.

The diagram on the right illustrates the concept.  Nodes are decision-making points, where the crawlers are detained for time periods that can be described by known distribution (see the histograms).  At decision nodes (e.g., N2_2 in this diagram), each link is associated with a given frequency of selection (the binary histogram next to it).


The beauty of Monte-Carlo simulation is that it can do fast-time modeling, simulating the time faster than it happens in the model's real-world protagonist system.

It can be achieved in a number of ways, most popular of which are:

1.  Real-time: when running human-in-the-loop simulations, with actual humans making some of the decisions, as well as in psychology and social sciences research, real-time Monte-Carlo situation is irreplaceable, as it allows subjects to make decisions in the actual tempo of the world they are used to.  It is also very popular in the game business, with SimCity and Earthquake being among the earliest adopters of this approach, where the user is in control of the simulated world. 

Interactive online games are the next step in the evolution of the real-time Monte-Carlo simulation, approach, where users are in control, and all the hundreds (thousands?) of users, by virtue of playing the game, become the simulation.  The downside of using interactive-game environment in any research is that the number of users varies, and that will likely make any data collected from such a setup too noisy to be useful.

2.  Time-scale: the simulated time delays are actually waited out on a scale by the modeling computer.  This method allows building very high-fidelity event-and-process models; however, it may not be very efficient for modeling situations that take a long time to complete without much happening; e.g., a simulation of a day of operations of a police station in a quiet neighborhood of Beverly Hills, CA would be too boring to analyze and it would take way too long for any analyst's liking, even if the scale is 1:24, allowing a day in simulated time to pass in 1 hour or the simulation's runtime.  Another drawback of time-scale simulations is in their dependency on the simulator running: a stoppage in the operation of the program software or hardware will cause a disconnect between the simulated time and real time, which time-scale modeling developers need to be aware of and account for in their disaster recovery and exception handling practices.

3.  Fast-time: this is yet another change of paradigm with regard to what the nodes and links represent.  (See illustration below)


Each crawler (aka Agent) knows what the timestamp of its next event will be.  The event may be any change of the crawler's current state (spatial location, heading on the compass, number of friends on Facebook, job status, citizenship, oil pressure, speed, funding status, etc).

A master controller keeps the simulation time and a sorted collection of agents in the order of next event times.  All agents act independently or in interaction with each other via the rules defined for them in the simulation program or configuration files.  Each agent only exposes its functionality and fields that it needs other agents to be aware of.  In addition, each agent may have a list of agents that need to be updated with it, in an Observer type of design pattern.

The inherent flexibility of this simulation, as well as its ability to run faster than real time while the same time making more productive use of the hardware than does the time-scale or real-time simulation makes it the key to efficient discrete-event simulation and the paradigm of choice for many analyses.

Saturday, May 22, 2010

More on Monte-Carlo simulation: If we know the equation.

There are lots of examples of using Monte-Carlo analysis in situations when we have the relationship between the independent variables (factors) and the dependent variable(s).

For example, let's consider a performance-risk-analysis problem for a given multiprocessor system, and suppose we know that this system is described by the universal scalability model derived by Dr. Neil Gunther. (The latter is a very reasonable assumption, considering that the USM is known to be perhaps the most valid representation of multi-processor capacity. Suppose, further, that we have been able to get enough throughput measurements to derive the parameters (σ and k) for this system's model:

C(p)   =  p

1 + α (p − 1) + β p (p − 1) 
(1)

(Neil, thank you for the HTML !)

In this formula, C is the throughput, or capacity of the system; α is the contention component, and β is the coherency component of the system in question. The p is the number of processors.

We are asked to predict whether the system will be able to handle the workload given a probability distribution of p = the number of processors operational on this system at any given moment in time (it may just as well be switches, routers, controllers, etc.).

In mathematical terms, we are asked to determine the distribution of the system's throughput (capacity) if the number of processors follows a certain probability distribution.


(This is inserted on July 5, 2002:)In addition, when we are calculating the model parameters  α and β, we can only do so with some degree of uncertainty.  That implies a distribution of possible values for the parameters, with the calculated numbers being, hopefully, the mean and mode of these distributions.  In the best-case scenario, we can assume a normal distribution with a standard deviation of zero, in which case the only probabilistic term is the number of processors in operation.
More in a blog due next week (hopefully)
(End of July 5, 2002 insertion)

We have the formula and the distribution for the p; we vary the p by rolling the dice out of this distribution and calculate the values for the system's capacity, C(p). The more times we roll the dice, the more accurately the data will resemble the distribution of the capacity for this system.

Parenthetically: generally speaking, when applying the Monte-Carlo or any other analytical technique, care must be taken that the distribution of p should be reasonably within the range on which the parameters were determined. In the case of USM, however,this comment may be considered of low relevance.

We then run statistical tests to determine whether the C(p) that we calculated will meet the demands for the system's capacity.

What tests to run, and how to interpret the results, is the subject of another post.

Thursday, May 20, 2010

Discrete-event simulation and capacity planning

When we have historical time-series data and want to plan the system's capacity (and it doesn't matter if we are talking about a storage, network, airspace, or classroom system; any non-isolated system that needs a way to store data, signals, students, airplanes, etc.), we know what to do: see my previous post.

In this case, we analyze the data, generate forecasts, and analyze the forecasts with reference to installed capacity to determine the forecasted system resource utilization, which will lead us to identifying capacity needs for the future.

Failure Analysis

There is, however, a class of problems where we want to predict the system's capability to perform without error, but errors are rare events that have no right to happen. Traffic safety comes to mind as an example of a problem of this class. We may observe such a system till we turn blue in the face and notice no failures. Does it mean that the system is safe?

The Nature of System Failures

By failure, I refer to any kind of the system not being adequate to the task or not performing as specified. It can be a defect rejected from a production line, or a gridlock on a network, or a crash. A process parameter outside the control limits, too, can be treated as a failure.

Failures are (hopefully) rare events distributed according to the Poisson distribution, which describes the frequency of rare events in a given time period, group of entities, etc. For these, time-series analysis is not the right tool. A discrete-event simulation is more adequate to the task.

About Discreet-Event Simulation:


Any discreet-event simulation (aka DES) operates on the principle of abrupt transition from state to state. An event is merely the other name for such transition. It is characterized by the timestamp, previous state, and new state.


When the simulation has processed the System State Before Event (I will call it SSBE here), it immediately advances the simulation clock to the timestamp of the System State After Event (SSAE). Of course in the real world, the real-time clock keeps ticking, and there is some time-scaling going on, as opposed to truly discreet transition; but for the purposes of the simulation this approach nevertheless allows to model real world faster than they really happen. That's why the other name for discreet-event simulation is Fast-Time simulation.

Validation and Calibration:
However, it is critically important that the model be properly validated, based on the actual historical data, for reasons I am going to talk about later.

Once a model is built for the simulation, it is run in the Monte-Carlo manner, repeating the runs multiple times and analyzing the outcomes for deviation from the historical data. Regular T-test (Z-test) can be used for each of the critical parameters of the model. If any parameters have not been validated, it is possible that the model will have to be modified, and then the entire cycle will have to be done again.

This is the basic validation process.

Its outcome - a simulation properly validated - can be called benchmarked for the base use case.

In the ideal world, we have data for multiple use cases. If so, we can not only validate the model, but also calibrate it against a range of conditions.


Monte-Carlo and failure analysis

Now, if we have ranges and distributions for the independent variables, we can vary them randomly ("rolling the dice") and run the simulation, getting the dependent variables. This process is called - surprise!- Monte-Carlo simulation. Assuming that the data that we validated the model with adequately represent the system, we can then, based on the outcome of the model, predict how the real system will likely behave. All system failures, no matter how unlikely, will inevitably appear with sufficiently large number of iterations.


Monte-Carlo and Robust Experimentation
If we have been able to calibrate the model, too, then we can run all kinds of robust experiment designs, covering the range of independent variable values that correspond to the two conditions at which the model was calibrated. The beauty of it is, the experiments are being conducted on the model, without the need to take the real system out of operation for the duration of conducting the experiments.

More on design of experiment, robust design, etc. is coming in one of my next posts.

Wednesday, May 19, 2010

About capacity planning

On March 31, ComputerWorld ran an article "How to develop an effective capacity planning process"

(http://www.computerworld.com/s/article/9174422/How_to_develop_an_effective_capacity_planning_process)
Excerpt:
* 1. Select an appropriate capacity planning process owner.
* 2. Identify the key resources to be measured.
* 3. Measure the utilizations or performance of the resources.
* 4. Compare utilizations to maximum capacities.
* 5. Collect workload forecasts from developers and users.
* 6. Transform workload forecasts into IT resource requirements.
* 7. Map requirements onto existing utilizations.
* 8. Predict when the shop will be out of capacity.
* 9. Update forecasts and utilizations.
"

It's all very true, but in the ideal world. In particular, the Item 5 is not very likely to happen in the real world, for a number of reasons.

First, workload forecasts are normally derived from the general amount of work in the corporation - private information usually kept close to heart by the corporate business wizards, as knowledge of such info is classified as insider information.

Second, developers and other IT users will at best have an understanding of what such information is being collected for, but it is in the farthest backburner of their minds and typically will be forgotten right after you talk to them about it: they have more pressing issues to worry about, and developers want to optimize the resources they have, anyway.

Third, transformation from workload prediction information to capacity utilization is an ill-defined problem, the only possible solution to which is, "it depends".

Even such a seemingly trivial problem as finding the correlation between corporate headcount and utilization of the Exchange server capacity (storage, performance, or both) will never follow a single formula, but will depend on a number of conditions, from Exchange server configuration, allocation, and configuration to corporate policies, etc.

When the company is using a SAN cloud, MapReduce in any of its incarnations, or any other distributed system, such transformation of developer workload forecasts into corporate IT utilization speculations very quickly degenerates into an exercise in Professor McGonagall's art of transfiguration and results in what Scott Armstrong called, "Bold Freehand Extrapolation".

Such forecasts cannot really be used to do serious data-driven capacity planning.

The right approach would be to collect the actual usage data from the IT network, and once sufficient amount of data have been collected, calculating a Time-Series-Analysis (TSA) forecast on each piece of hardware for which usage data have been collected.

This approach is characterized by 5 very important features:

1. It reduces the guesswork to the bare minimum and almost makes it non-arbitrary: forecasts become data-driven.
2. Assuming the right tools have been used for forecasting, it produces accurate and reliable forecasts
3. It makes its user aware of the uncertainty inherent in forecasting: when we say that we will use X amount of resource, we choose to ignore the uncertainty; when we use a statistical tool to come up with the X, the uncertainty comes up, and we have to deal with it - which is an inconvenience, but a good thing: it keeps us and our partners honest.
4. It gives us a way to judge the quality of the forecast.
5. It lends itself easily to multi-level forecasting, allowing us to build up forecasts for higher levels of integration from the lower levels of granularity.

All of this together is a very convincing reason why we should not rely on users' and developers' information in capacity planning, but instead use real forecasts.

Monday, May 17, 2010

How it all started: Part 1 / 4 (for the remaining 3 parts, click Comments)

Allow me to introduce myself: AlexG. I am originally from Moscow, Russia; have been living in California (Bay Area) over the last 19 years. I have a family, kids, and a great interest in life and many different things it offers.

I've got an excellent graduate-level education in Applied Math, Control System Engineering, and Operations Research, among other things. This education carried me along into getting a professional engineer (PE) license and into Semiconductor Manufacturing, Six Sigma, and real Data Analytics, creating software for Data Mining, Fast-Time Simulations, and Statistical Forecasting.

Along the way, I generated papers and patent applications in every domain I touched, from waste treatment to data mining.

I joined EG&G Amorphous Silicon (which is now Perkin-Elmer Optoelectronics) in 1995 to be a Process Engineer in a semiconductor production startup environment. A couple of years later, I was picked by the management for Six Sigma Black Belt training, and upon completion of the course, became the local expert in data-driven process improvement.

I was involved in a number of DMAIC ("Define, Measure, Analyze, Improve, Control" - the Six Sigma mantra, aka GEMS - "Gather, Examine, Modify, Sustain" - the GE Medical Systems' Six Sigma mantra) projects, aimed at improvement of the consistency and quality of the critical to quality (CTQ) process parameters driving the bottom-line.

In that capacity, I realized two very important things:

1. That it is impossible to be successful in data-driven process improvement without using specialized software for data analytics.

2. That it is very expensive and silly to do experiments on live production lines. The cost of this may offset the benefit of the knowledge and control over the CTQ parameters, driving the ROI into the ground.

These two facts compelled me to ...

(To read on, click the Comments link)