Sunday, April 19, 2015

Discovering Patterns in Irregular Behavior: Part 2

Mathematical description of irregular behavior is the holy grail of statistical analysis, akin to a game of croquet that Alice reluctantly played with the Queen of Hearts.  (If you have not read or watched Lewis Carroll’s classic, please refer to a brief description of the game here.)  A lot of uncertainty and ambiguity, just like in the world of data.  Could Alice have won the game?   

About Previous Post

In the previous post, we set the scene and explored the history of the problem.  If you want to read about it from the sage who actually made the history of modern data science, you should get offline and read the book “An Accidental Statistician” by George Edward Pelham Box - the father of modern statistics.  Sadly, he passed away in 2013 at 93 years of age.  He was one of the titans of statistics responsible for design of experiments, statistical quality control, time-series analysis, evolutionary operation, useful transformations, and the return of Bayesian methods into the mainstream of Data Science.  The true impact of his work will be reverberating for many decades.

We also started a discussion of what Data Science really is, and explained briefly how the two new approaches were enabled by technology to become the two prongs of the never-ending attack on data: Monte-Carlo became the method of choice that eliminated the need for pigeonholing distributions, while Machine Learning became the tool that promised to eliminate the need for a human in the loop of data analysis, setting us free of the burden of mechanical crunching of data and giving us time to think.

Enter Random Events: the Biggest Paradox of Data Science

When it comes to random events, we tend to go long circuitous routes in order
to bring random events back into the fold of the Familiar. Paradoxes tend to scare us.  We are like Alice, sharp enough to see the differences between how we think the world works and what we see, but clueless what to do about these discrepancies.


What Would Bohr Do?


We have a system whose behavior we think we know.  When we say “we know behavior of a system”, what we really mean is that we can predict its behavior at any moment in time with a satisfactory degree of certainty.  The system is predictable, self-consistent, “sane”.  We know the distributions of the data and can determine the tolerances that put us into given confidence intervals.  

But then - blame Murphy’s Law, or Gȍdel’s Incompleteness Theorems, or any of a number of conspiracy theories - life throws us a curveball, and the system stops behaving the way we know it to behave.  We really run into a paradox: what we know about the system gets challenged, and our first feeling is crushed, because few of us can say calmly, like Niels Bohr: “How wonderful that we have met with a paradox. Now we have some hope of making progress

The paradox in modeling is similar to Bohr’s legacy: if we perfectly fit a model to its previous behavior, its predictive ability will be lower than if we build a model that does not fit 100% of the historical data.  The planetary model of the atom was not an oversimplification.  It was a generalization which, combined with evolutionary (some would call it Bayesian) approach to modeling, allows us to accurately model the world: “Essentially, all models are wrong, but some are useful”.

What is in common between Alice playing strange croquet, our tea-making R&D team, and any data analytics expert?


They all find themselves in a situation where observations do not fit their model.  This is not an unusual situation, and our three members of the tea-maker design team did what every rational person would do under the circumstances.

Any change in initial conditions, environment, or observer's perception will have an effect on the parameters and even the structure of the model describing the scenario.

What do we do?
Imagine that you asked a team of a physicist, a software engineer, and a mathematician to come up with an algorithm to make tea. There is little doubt that they will take no time completing the task:

  • Pour water into a kettle
  • Start heating the water in the kettle
  • While it is coming to boiling (100oC / 212oF) , the operator has about 10 minutes to put the tea leaves (or a tea bag) into the cup(s).
  • When the water has reached just a degree or two below 100°C (212°F), pour it into the above-mentioned cup(s).

And then you preheat the tap water to 97°C (206°F) in an industrial boiler and ask each of them to individually validate the algorithm.

The physicist will explain why it is impossible to have liquid water stored at its boiling point using nothing but household items, even after you show him that you have it stored.

The software engineer worth his salt will most likely see the performance improvement you have just made possible by preheating the water and volunteer to rewrite the algorithm they all devised to make it more flexible while optimizing some of the code:  if you have multiple cups of tea to prepare, you could set them all up in parallel with the leaves/bags, and then if you can pour water into all cups concurrently...

Finally, the mathematician will absentmindedly pour the hot water into the kettle, start heating the water in the kettle, put the teabag into the cup, using this time (while the water was being poured it cooled down somewhat), to contemplate the probability of such an event happening and what prior events could have led to it.

He will be genuinely surprised that this time the water reached the boiling temperature so fast and will likely call his friend the physicist as the subject-matter expert and discuss it with him.

Once they have agreed on a theory that the max-likelihood cause of this curious behavior was the significantly (both statistically and practically) higher initial temperature, they will invite the software engineer to a meeting, who will come immediately with three steaming cups of excellent tea made by applying his new algorithm to the preheated water. They will then proceed to ask him to redesign the program to account for the bizarre event of water initial temperature being higher than the expectation, and will show the equations that they will want him to use in his tea program.

The next day, the abstract of a new publication authored by the three friends having to do with discovery of new properties of the kettle material will be on your desk.

With Data, Nothing is Ever Deterministic
As chaos theory suggests, deterministic behavior can lead to pseudo-random behavior, but the Second Law of Thermodynamics ensures that the opposite is not true.

In practice, any measured data value can be described as a sum or a product of two measures:

Additive Form





Multiplicative Form



Naturally, for the positive values of X_measured a simple log transformation can turn multiplicative form into additive:









and we get an additive form by reassigning:








If X_measured is zero or negative, the log function is not defined for such values, but we can “normalize” the variable (divide it by its range), bringing it into the (0, 1] range, where logarithm is defined.  Because such range normalization is done by dividing the value by the range, a more accurate form of Eq. (3) and Eq. (4) would be:






This works for the deterministic component of the measured value. However, the shape of the distribution of the stochastic component will change in a log transformation, leading to general smoothing out of the right tail of the distribution..

Deterministic Component

The deterministic component is what goes into the mathematical models when we attempt to understand how something works by writing equations to describe the physical behavior of the objects.  From Newton’s Second Law to general theory of relativity, we have been looking for, and finding, the equations that describe the world around us.  

Stochastic Component

However, when we ask ourselves how accurate our equations are, we run into validation issues.  The stochastic component can be confoundingly big, hiding the underlying laws of system behavior behind the guise of randomness (more about it, e.g., in an earlier post in this blog).

The stochastic component of the observed data point has two elements:
  • Out-of-scope features - variables that are not accounted for in the model or in the data we have.
  • Measurement error - an intrinsic property of the measurement system.  Error in estimating regression parameters also falls under this category.

These two are orthogonal (independent); therefore,




Because of the first element of the stochastic component, and the Incompleteness Theorems, data measured from a live process will never be purely deterministic.  Therefore, our prediction models can never be 100% accurate.

You Know What To Do

There are only two possible courses of action when it comes to randomness:
  • embrace it and understand it
  • embrace it and account for it in your models.


(To be continued)



1 comment: