A very special case of non-normal distribution is the Poisson distribution. I call it special because it can be used in analyzing data for seasonality (any periodic pattern is by convention called "seasonal". The reasons why it is so are hidden behind the shroud of mystery. I honestly tried to understand them, but failed; so I just joined the crowd that says that seasonal patterns don't have to follow the annual cycle, and left it at that.)
In the ideal world, if there is a 7-day (or a 3-month, or a 4-week) pattern in the data, we can count on this pattern repeating consistently. Moon phases (and consequently the tidal movements of water) would be a good example of that. For cases like this, Fourier analysis works wonders, allowing us to analyze the data and pick out all the precise seasonalities and then sort them in the order of magnitude (amplitude) and pick the most significant ones (more on significance later in this post).
But unless we are in a business related to open-sea fishery, offshore oil drilling, lighthouse operations, or any other where lunar phases are important, most of the time we do not have such a clear-cut seasonal pattern: a maintenance event may be off by a day; periodic deliveries may not always happen on time; task processing may take longer than it usually does; etc.
In these cases, we see see outliers (events) following an approximately periodic pattern, with clusters of data between the outliers. We can view this problem from two sides:
1. We can analyze the intervals (sizes of these clusters) and, if it follows Erlang distribution, we may be able to predict the amount of traffic based on the queueing theory. Erlang distribution, however, is not very simple in data analysis (albeit very useful), as the correct formula depends on what is done to the blocked calls (packets): they may be aborted or queued. For more on Erlang distribution, see http://en.wikipedia.org/wiki/Erlang_distribution
In addition, not all situations can be reduced to a queueing theory application, and sometimes a more generic approach is needed.
2. We can look at the problem from the other end and sort the cluster sizes, with the assumption that some fuzziness is expected: e.g., an interval of cluster sizes of 5-8 days is weekly, whereas 26-35 is monthly, etc. That done, we can analyze the numbers of outliers (events) that occurred during these intervals, and if the most 'prominent' period (the mode of the cluster size distribution) has had numbers of events following the Poisson distribution, then we have a seasonality. This seasonality can then be used as a parameter in forecasting the data.
Great post!
ReplyDeleteWhat about in the case of a financial trade-results time series,
in which there are a sequence of trade results (wins, and losses),
and I am trying to model the distribution of losses after wins (assuming a low win rate of around 40%).
Sometimes there is a win after a win (0 losses), sometimes 1 loss after a win, sometimes 2 losses, up to even some extreme infrequent outliers of 20 losses after wins.
I have modeled this and it looks like a Poisson distribution as you mention above.
Would the solution be to take the mode of the most frequent loss result (lets say the mode is 4) ?
Since it is a weakly stationary time series, how many sample size loss-periods (0, 1, 2, 3....20) should I sample to keep recent relevant results that have not drifted too far from the stationary mean?
My estimation is at least 5 samples of loss-periods to maximum 20 loss-periods, with maybe around 10 is optimal.
Thanks for you insights!
Thanks, Daniel! You've raised a very interesting question, to which I finally found an answer some time ago and am going to post on this blog in a couple of weeks (or less).
ReplyDelete