Unsatisfied by my previous post‘s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The list, where each list item starts from brief description of common recurring analytics problem and follows by elaboration by commonly accepted solutions to this problem followed by mandatory example section illustrating the solution using widely available tools.
Software engineers stolen this idea from the real architects (those dealing with a concrete structures not an abstract ones ) 15 years ago. They haven’t avoided initial short period of mass obsession and abuse of the concept… who does? But eventually it worked out quite well for them us. I wonder if analytics industry could leverage these experience and create a catalog of some 25-50 most common patterns. Pattern descriptions in a catalog not to exceed few pages and number of patterns limited to few tens, making it wide industry adoption feasible.
What you think? Any ideas? I’ll try to make a first step by dumping patterns from my head right now (it is by no way a finished work):
I’ll call it analytics patterns:
1. Predictive Analytics. That was the easiest for me. I was involved into it for the first time some 12 years ago and developing what is now http://www.oracle.com/demantra/index.html. The system was used mostly to forecast sales taking into account an array of causal factors like seasonality, marketing campaigns, historical growth rates and etc. The problem is that there is a lot of time-based historic data available and it is required to forecast future values in the context of given historic data. The basic mechanism of implementing Predictive Analytics is to find or less preferably to develop a suitable mathematical model that can model closely (but be cautious about overfitting) existing data, usually a time-series data and then use the model to induce forecasted values. In simple terms it is a case of extrapolation. Correct me if I’m wrong. As it was the case in 90-ties I’m pretty sure it is the case now, that exotic hardcore AI approaches like neural networks & genetic programming are best kept exclusively for moonlighting experiments and as material for cooler conversation the next morning. With deadlines defined and limited budget it is best to stick to proven techniques to achieve quick wins. I think the value of working forecasting is self evident.
2. Clustering. Well not the heavy noisy one in a cold hall but the statistics sub-discipline called better cluster-analysis. The problem here is that a lot of high-dimensionality data is available and it is required to discover groups with similar observations in other words automatically classify them. It is implemented by searching for correlations grouping the records according to the discovered correlations. What it is good for? Well in simple terms it helps to discriminate different kinds of objects and observe the specific properties of each kind. Without such grouping, one would be able only to observe properties that all objects exhibit or alternatively go object by object and observe it in isolation.
3. Risk Analysis - particularly through Monte-Carlo simulation. It is not called Monte-Carlo because it is invented there it called so because of reliance on random numbers akin Monte-Carlo casinos. Random numbers are proved most effective way to simulate mathematical model with large number of free-variables. With advent of computers it became a whole lot easier than using the book.
4. Given telecom event stream, run events through the rules engine to detect and prevent telecom fraud in real-time. This is essentially CEP engine and usually implemented by creating a state-machine per rule and running the events through it. Special version of stream sql is used. Similar scheme can be used for real-time click fraud prevention.
5. Given serialized object data or nested data allow running ad-hoc interactive queries over it in BigQuery fashion.
6. Given normalized relational model, allow running any ad-hoc queries. For common joins create a materialized view to speed up joins.
7. Canned reports. I guess they are good also for some cases…….
8. OLAP/Star schema when to use? ……
Of course it is just a first step and to do it correctly it will be a project in itself, in form of a book most probably. However, as one Chinese proverb goes “A journey of a thousand miles begins with a single step”.