Thursday, May 20, 2010

Discrete-event simulation and capacity planning

When we have historical time-series data and want to plan the system's capacity (and it doesn't matter if we are talking about a storage, network, airspace, or classroom system; any non-isolated system that needs a way to store data, signals, students, airplanes, etc.), we know what to do: see my previous post.

In this case, we analyze the data, generate forecasts, and analyze the forecasts with reference to installed capacity to determine the forecasted system resource utilization, which will lead us to identifying capacity needs for the future.

Failure Analysis

There is, however, a class of problems where we want to predict the system's capability to perform without error, but errors are rare events that have no right to happen. Traffic safety comes to mind as an example of a problem of this class. We may observe such a system till we turn blue in the face and notice no failures. Does it mean that the system is safe?

The Nature of System Failures

By failure, I refer to any kind of the system not being adequate to the task or not performing as specified. It can be a defect rejected from a production line, or a gridlock on a network, or a crash. A process parameter outside the control limits, too, can be treated as a failure.

Failures are (hopefully) rare events distributed according to the Poisson distribution, which describes the frequency of rare events in a given time period, group of entities, etc. For these, time-series analysis is not the right tool. A discrete-event simulation is more adequate to the task.

About Discreet-Event Simulation:


Any discreet-event simulation (aka DES) operates on the principle of abrupt transition from state to state. An event is merely the other name for such transition. It is characterized by the timestamp, previous state, and new state.


When the simulation has processed the System State Before Event (I will call it SSBE here), it immediately advances the simulation clock to the timestamp of the System State After Event (SSAE). Of course in the real world, the real-time clock keeps ticking, and there is some time-scaling going on, as opposed to truly discreet transition; but for the purposes of the simulation this approach nevertheless allows to model real world faster than they really happen. That's why the other name for discreet-event simulation is Fast-Time simulation.

Validation and Calibration:
However, it is critically important that the model be properly validated, based on the actual historical data, for reasons I am going to talk about later.

Once a model is built for the simulation, it is run in the Monte-Carlo manner, repeating the runs multiple times and analyzing the outcomes for deviation from the historical data. Regular T-test (Z-test) can be used for each of the critical parameters of the model. If any parameters have not been validated, it is possible that the model will have to be modified, and then the entire cycle will have to be done again.

This is the basic validation process.

Its outcome - a simulation properly validated - can be called benchmarked for the base use case.

In the ideal world, we have data for multiple use cases. If so, we can not only validate the model, but also calibrate it against a range of conditions.


Monte-Carlo and failure analysis

Now, if we have ranges and distributions for the independent variables, we can vary them randomly ("rolling the dice") and run the simulation, getting the dependent variables. This process is called - surprise!- Monte-Carlo simulation. Assuming that the data that we validated the model with adequately represent the system, we can then, based on the outcome of the model, predict how the real system will likely behave. All system failures, no matter how unlikely, will inevitably appear with sufficiently large number of iterations.


Monte-Carlo and Robust Experimentation
If we have been able to calibrate the model, too, then we can run all kinds of robust experiment designs, covering the range of independent variable values that correspond to the two conditions at which the model was calibrated. The beauty of it is, the experiments are being conducted on the model, without the need to take the real system out of operation for the duration of conducting the experiments.

More on design of experiment, robust design, etc. is coming in one of my next posts.

No comments:

Post a Comment