Wednesday, October 27, 2021

It has been a while since I posted here.  A lot has been going on.  I have been doing research work on two fronts - at work, as a Research Data Scientist and a Tech Lead for capacity and efficiency data science at Facebook.  

In 2020, I started developing transitive-resource-usage causal models.  Cannot disclose anything here, for obvious reasons.

With the amazing colleagues, we presented a queueing-based power model that we have developed and put into operation at the WWW '21: The Web Conference 2021 Ljubljana Slovenia April 19 - 23, 2021 (it was virtual, we did not go to Slovenia; maybe next time!).  The paper that became the foundation of the presentation became published in the ACM Digital Library.

At the same time, I made three presentations at INFORMS - 2019, 2020, and 2021 - as part of my Doctoral research at Stevens.  The links to the slides are here:

  • INFORMS 2019: Machine Learning and Data Mining in Identification of Unhappy Communities

  • INFORMS 2020Using Machine Learning to Identify the Factors of People's Mobility
  • INFORMS 2021: Social Cohesion and Emotion Analysis of Media During 2020 Wildfires: a Case Study
I also passed my Qualifying Exam and officially became a PhD Candidate.  I flew to defend it on campus at Stevens in March 2020.  My flight back home was one day before NYC went into lockdown.

I also published a paper in an academic journal: 

Gilgur, A., Ramirez-Marquez, J.E. Using Deductive Reasoning to Identify Unhappy Communities. Soc Indic Res 152, 581–605 (2020). https://doi.org/10.1007/s11205-020-02452-2


 The offprint is available on request.  The INFORMS 2019 talk was on the same topic, but in a more "live" form.

Other than work and research - we (Sophia and I both) got our ASA 104 (Bareboat Cruising) certificates; sailed in British Virgin Islands (BVI) in December 2019 - right before COVID struck; sailed in Croatia in September 2021.

Cheers!

 


Sunday, November 12, 2017

#CMGimPACt2017 was a blast!  Our paper, "The Curse of P90: An Elegant Way to Overcome It Without Magic", was very well received.  The presentation is uploaded to SlideShare: http://bit.ly/2mk4A0y .  The abstract is below:

Over the decades of development of methodologies and metrics for IT capacity planning and performance analysis, percentile terminology has become the lingua franca of the field. It makes sense: percentiles are easy to interpret, not sensitive to outliers, and directly usable for approximating the distribution of the variable being measured for stochastic simulations.  However, depending on which percentile is used, we can miss important information, like multimodality of the metric’s distribution. Another, less obvious, downside of relying on percentiles comes into play when we size infrastructure for a high percentile of demand (e.g., p90). Given that it takes time to order, manufacture, receive, and install infrastructure, this means that we need to answer the statistically nontrivial question, “what will this percentile of demand be a few years from now?” This paper discusses the issues that arise in answering it and proposes an elegant way of resolving them.
Thank you CMG for making the 43rd International Conference such a special event, and for choosing New Orleans as the venue.

Looking forward to more!

PS.  Our #CMGimPACt2016 presentation is in SlideShare as well: http://bit.ly/2zz2crX.

Wednesday, December 2, 2015

Discovering Patterns in Irregular Behavior: Part III

ABSTRACT
Mathematical description of irregular behavior is the holy grail of statistical analysis, akin to a game of croquet that Alice reluctantly played with the Queen of Hearts.  (If you have not read or watched Lewis Carroll’s classic, please refer to a brief description of the game here.)  A lot of uncertainty and ambiguity, just like in the world of data.  In this section, we will discuss one very specific pattern that is critically important in a wide variety of fields, from cardiology to speech recognition, namely seasonality detection.  

A Few Words about Seasonality Definitions

First of all, regardless of what Wikipedia says on the page related to Forecasting, seasonality only in some cases has something to do with the four seasons.  Or rather, seasonality that is related to the four seasons is only one very special case of seasonality.  Wikipedia gives the “right” definition of it on the Seasonality page:  

When there are patterns that repeat over known, fixed periods of time[1] within the data set it is considered to be seasonality,seasonal variation, periodic variation, or periodic fluctuations. This variation can be either regular or semi-regular.”

This is the seasonality that we are going to discuss here.

Seasonality Detection Methods

The most basic method of seasonality detection is to ask the SME (subject-matter expert).

Most SMEs will tell you that their data have a diurnal (a fancy word for “daily”), a weekly, or an annual pattern.  But few will tell you that they know what’s going on when the data don’t support their claim.

At this point, we might as well abandon hope to get any useful information from the human SMEs and dive into what the exciting world of machine learning, which in reality is nothing but no-holds-barred advanced statistics, applied, in the seasonality detection use case, to time series.

What is a Time Series?

Time Series is, according to Wikipedia’s article, “a sequence of data points, typically consisting of successive measurements made over a time interval”.  NIST is more precise in its definition:  “An ordered sequence of values of a variable at equally spaced time intervals”.  This definition is the only invariant in time-series analysis.  Colloquially, sometimes the word “ordered” is replaced with  “time-ordered”.  

Seasonality Detection in Time Series


Why?

The most obvious answer to the "why" question is clear from the main use case of Time Series Analysis, - forecasting.  Indeed, predicting future behavior of a seasonal time series without accounting for seasonality is akin to driving a small square peg into a big round hole: it fits, but the precision of such model will be poor.

Other use cases include speech recognition, anomaly detection, supply chain management, and stability analysis.

How?

The standard seasonality detection methods, like Fourier analysis, domain expertise, autocorrelation function, etc. work fairly well in most scenarios.  

However,
  • Domain experts are primarily humans
  • Autocorrelation -based seasonality detection involves thresholding, which is only one step better than domain expertise-based seasonality analysis
  • Fourier analysis often leads to models that overfit the data

This means that what we are left with is two extremes - too generalized (human) or too specific (Fourier analysis).











An illustration

Figure 1: Transatlantic traffic data from a European ISP.  Data cover June-July of 2005.  
Data downloaded from this URL on 2015-09-01

The data are flat enough that we can use this data set to illustrate the concepts we are about to introduce here.  To deal with fewer data points, we will go to hourly medians:


Figure 1a: hourly medians
While it seems fairly obvious visually that there is a daily and a weekly pattern in the data, standard methods will not be very useful.  For instance, the AutoCorrelation Function (ACF) returns pretty much the entire spectrum as frequencies:

Figure 1b: ACF
Bars whose absolute values are outside the dotted envelope traditionally represent the seasonality.While there is a reason behind the default value of 0.025 (the reason being in Fisherian statistics: if ACF is outside the 5% envelope, we cannot say with 95% confidence that the data are not seasonal); it is clear that with complex seasonal patterns like shown in Figure 1, it is not very meaningful.  Expanding the confidence envelope even to 50%, does not contribute much information to our knowledge about these seasonalities.  

Looking at Figure 1, we also notice that the upper and the lower parts of the data follow very different patterns: the maxima follow a 2+5 pattern, while the minima follow a daily rhythm with a slight weekly rhythm.  It is understandable if we recall that these data correspond to traffic through a large Internet Service Provider (ISP) in a very busy time of a year - June/July.

Let’s follow through with the ROC idea and see where it brings us.




























Figure 3: ROC time Series plot.  
Points following ROC change of sign are marked as red dots; all others are blank dots.







Figure 4.  Hour-to-Hour variability in traffic

In Figure 4(b), the points after which the sign of the first derivative changed from positive to negative correspond to the local maxima of the hourly median time series.  They are marked as blue filled dots. Points where second derivative changes sign are inflection points.  These are marked as boxes.

Let us look at the distribution of the time intervals between the sign-change points.
We see that the density plot, too, has multiple local maxima.


(a): Distribution of intervals between peaks (seconds)
(b): Distribution of intervals between valleys (seconds)

Figure 7: Density of time intervals between local maxima and minima of hourly traffic medians.

We see from Figure 7 (a & b) that peaks and valleys of the hourly traffic medians follow the same pattern.  Table 1 verifies this.  The peaks are observed at 3, 23, 47, and 56 hours, and so are the valleys.













Their complex interactions and interferences, not unlike the Wave Pendulum, produces the pattern we saw in Figure 1.

Conclusion
We have demonstrated a reliable method of detecting complex seasonal patterns in time series data. This method does not require Fourier decomposition, regression fitting, Neural Networks, or any other overkill techniques.  All that is needed is high-school Calculus, Probability and Statistics, fundamental understanding of Clustering, and a little common sense.

Saturday, November 14, 2015

CMG2015 Performance and Capacity International Conference

The #CMG2015 conference was an overwhelming success.  I have not been able to find a paper that would not be interesting, and all presentations were engaging, enlightening, and deep.  Not only was it a breath of fresh air - it always is that! - but also it was a great opportunity to learn and to network with the world's best IT Performance and Capacity experts.

The level of academic rigor and practical expertise at CMG Conferences is huge and growing.  It is not a trade show, and it is not a purely academic event, but something unique, and I really hope that it will continue in this direction: this way, CMG will grow in numbers while upholding its reputation as the only organization of IT professionals where academic knowledge and technical expertise have united to form this unique alloy that has been attracting us to keep coming to CMG Conferences again and again.

Our paper, "Percentile-Based Approach to Forecasting Workload Growth",  has got a lot of good attention too, and I received the prestigious Mullen Award for presenting it!

Thank you, CMG!  I really hope to see you all in La Jolla at CMG2016!

Friday, October 9, 2015

"Percentile-Based Approach to Forecasting Workload Growth"

It is the title of our paper that has been accepted to the Performance and Capacity (CMG'15) International Conference that is taking place in San Antonio, first week of November.  If you were a Conference referee who let it happen, THANK YOU!!!

If you are going to the Conference, I am presenting the paper on Thursday, November 5, at 10:30 AM in Naylor (Session 525).  I'll be happy to see you there!