By Camuel Gilyadov, on October 8th, 2010

CAP equivalent for analytics?

CAP theorem deals with trade-off in transactional system. It doesn’t need an introduction, unless of course you have been busy on the moon for last couple of years. In this case you can easily Google for good intros. Here is a wikipedia entry on the subject.

I was thinking how would I build an ideal analytics system. Quickly came realization that all “care abouts” cannot be satisfied simultaneously, even assuming enough time for development. Some desirable properties must be sacrificed in favor for others, hence architectural trade-offs are unavoidable in principle. I immediately had déjà vu regarding CAP. So the following is my take on the subject:

SVLC hypothesis regarding architectural trade-offs in analytics

I haven’t came to rigorous definition yet, here is an intuitive one:  Current technology doesn’t allow implementation of a single analytics system that is SVLC which is simultaneously sophisticated, high-volume, low-latency and low-cost .One of these four properties must be sacrificed, the extent to which it is sacrificied determines the extend to which other properties could potentially be implemented.

Deep dive for the brave souls

Let’s reiterate the desired system properties first (see ideal analytics system):

  1. Deep Sophistication => …free-form SQL:2008 with multi-way joins of 2 and more big tables, sorts of big tables and all the rest of data heavy lifting.
  2. High Volume => …handling big data volumes, Let’s cap it 1PB meanwhile for easier thinking.
  3. Low Latency => …subsecond response time for the query on average. A more concrete description is that latency must be low enough to allow analyst working interactively in conversational manner with the system.
  4. Low-Cost => … I’ll define it as commodity hardware and software must not exceed hardware costs. More rigorously? $1/GB/month for actively queried data is my very rough estimation for low-cost.
  5. Multi-form => any data, relational, serialized objects, text etc….
  6. Security => can speak for itself

I found that multi-formness and security doesn’t interfere with implementing the rest of properties and can in principal always be implemented in satisfactory way without major compromises. Some nuances exists tough, but I’ll ignore them for clearness. So removing them and getting the following list:

1. Sophistication (deep)              => S

2. Volume (high)                          => V

3. Latency (low)                           => L

4. Cost (low)                                  => C

These four are highly inter-related and form а constraint system . Implementing one to full extent hampers the rest. Let’s see what trade-offs we have here. Four properties that is 6 potential simple 2-extremes trade-offs. Let’s settle on geometric tetrahedron to model the architectural trade-off space. Four properties correspond to four vertexes and six trade-offs correspond to six edges. Then we model particular trade-off by putting a point on the corresponding edge. So we get something like this:

Okey, so far so good. Now, I’ll try to be а devil advocate and challenge my point that any trade-offs are necessary in the first place. So let’s review the system denoted as

SVLC=> high-volume, low-latency, deep analytics, low cost

Because it is low-latency it will need I/O throughput adequate to scan whole dataset quickly and since it is high-volume (see above for quantitative definition) meaning aforementioned dataset is big, it will need a large number of individual nodes in cluster to provide the required aggregated adequate I/O throughput. The number of machines is further increased with low-cost requirement meaning that simpler servers that are in mainstream sweetspot must be purchased. Therefore system becomes extremely distributed and data being dispersed all over it. The low-cost networking usually mean TCP/IP that is high-overhead, high-latency and low-throughput. Deep Sophistication analytics requires performing complex data-intensive operations like full sorting of big datasets, joining big tables or just simple select distinct over big data that will inevitably have long latencies. Once latency is long enough that probability of node failing mid-query become non-neglectable. Latency increase becomes self-perpetuating because of required finer grain of  intermediate result materialization. This is needed to prevent never-ending query restarting and provide a kind of resumable queries. Not other solutions to resumable queries are documented except MapReduce-style intermediate result materialization. This ultimately makes latency batch-class long violating low-latency requirement.

I guess my proof miss the required rigor to be considered seriously by academics, I’m just an engineer :) I love to see it reworked to something more serious tough. I just hope to get the point across and to be of value to engineers and practicing architects.

Anyhow this is the base of my hypothesis showing that it impossible to achieve full SVLC using today technology.

Let’s consider other cases where we give up something. It is easy to visualize such trade-off as a 2D plane dissecting tetrahedron. The three points were three edges are cut corresponds to three trade-offs. For simplicity I’ll elaborate only radical trade-offs in this post. Radical trade-offs are those were on all six trade-off edges one extreme is selected and this corresponds to putting a trade-off point on one of vertices. Most real-world system make temperate trade-offs that corresponds to the plane that dissects the tetrahedron into sizable parts. Moreover real-world system, especially available from commercial vendors, are a toolbox kind of a system. Meaning that system consists of a set of tools where each one makes a different set of trade-offs. Then it is up to engineer to choose the right tool for the job to the toolbox. However, toolbox approach is not a loophole for this hypothesis, because properties of the different tools don’t add up in desired way. For example the simultaneous use of expensive tool and low-cost tool is still expensive; the simultaneous use of low-latency and high-latency is high-latency. Nevertheless, toolbox approach is best one for real-world problems. Because real-world problems are usually decomposed to a number of sub problems where each may require different tool.

Well…back to the radical systems… Let’s consider all four cases where we completely give up one property to max out the rest three:


SVL => high-volume, low-latency, deep analytics …giving up Cost… seems to be implementable. In its pure form it reminds classic national security style system. Subsecond querying petabyte-scale dataset with arbitrary joins. Heavily over provisioned Netezza / Exadata / Greenplum / Aster and other MPP-system could do it I believe. Data kept in huge RAM or on flash, huge I/O is available to scan the whole dataset in matter of seconds. High-speed, low-overhead networking is available to with huge bi-section network bandwidth capable to shuffle the whole dataset in matter of seconds. Infiniband/RDMA are the best probably. How bad Cost can be here? Well… unhealthy to imagine. Throw some numbers in anyway? Will do some back of envelope calculation in my future posts.

SVC => high-volume, low-cost, deep analytics …giving up Latency… seems to be implementable, in fact it is MapReduce territory, Hadoop natural habitat. Are ETL systems SVC? I think no, because while they given up Latency they haven’t kept on Volume. How bad is Latency? well… forget interactivity, create queuing system and get notified when the job is done. If too slow add servers. If some interactive experimentation is needed, use VLC first to develop and prove your hypothesis and only than crunch the data with SVC. Since cost is involved I guess Hadoop MapReduce is really a king here. Tough if Aster licenses for example are comparable to commodity cluster overall cost and is not many multiples of it, then it could fit the category nicely. Otherwise it will make suboptimal (considering my model context not in wider sense!) great SV system. The great MapReduce debate is not for nothing!

SLC => low-latency, low-cost, deep analytics …seems to be implementable in a minute, just start your favorite spreadsheet application ;) You will be shocked how much data Excel crunches in just few seconds, nowadays. Most traditional BI tools are in this category too. Heck, if not for BigData, the analytics industry will be as would become as exciting as enterprise payroll systems. Though, innovation is possible even there.  Heck, 99% of BI is fully feasible to be done completely  in-memory, often on single server and the deployment must be really low-risk low-cost very-rapid if done correctly. Most cloud BI vendors are also in this category. “R-project” is here too. This was Kickfire beloved spot as well as is now for QlikTech & GoofData, PivotLink and etc… So pretty much all BI vendors are here except MPP heavy-lifters. How bad Volumes are limited? Well with CPU-DRAM bandwidth being 50GB/sec and DRAM sizes 64GB on common commodity servers I think crunching few tens GB should be well possible in matter of seconds, if not for implementation sloppiness, and with literally pocket money (average enterprise’s pocket not mine… yet).

VLC => high-volume, low-latency, low-cost …giving up Sophistication…seems to be implementable, that is doing a simple scan and giving up the Sophistication, particularly joins. Dremel and BigQuery seem following this approach. How bad is giving up Sophistication? Well, it all depends on how pre-joined/nested dataset is. With normalized schemas, well… unavailability of joins makes it pretty much impractical implement any usable analytics. However, with star-schema and particularly nested data (with some extensive pre-joins even if it means some redundancy), this can work wonders to vast majority of queries, completing them in seconds on even large datasets. However, no pre-join strategy will work for 100% of queries and functions like COUNT DISTINCT must be approximated when run over big dataset like described in Dremel paper. Also I’ll assign sampling strategy to this category, because sophistication also means accuracy here. One clarification: only joins of several big tables are sacrificed here, joins of big table with even large number of small tables are perfectly okay and done on the fly during the scan. Sorts of big table before it was reduced significantly to manageable size is also sacrificed in this approach, however approximation algorithms can be used for this and then it will be okay too.

Hence the conclusion: only 3 of 4 SVLC properties can be implemented in full extend in single analytic system. The hypothesis goes that any attempt that allegedly violates it, in fact either is no a single system or impairment is latent in one or more properties.

[TODO: rewrite] The extended hypothesis for fractional cases:

  • Systems/trade-offs may be radical or temperate. Radical trade-off completely gives up one of four properties of the system. Temperate trade-off gives-up the property only fractionally on expense of giving-up other properties also fractionally.
  • Most real-world systems are complex. They are a set of tools, where each separate tool is a concrete trade-off. Then the user of such system can use different tools with different trade-off sequentially or simultaneously. This may seems as way out of the restraint; however it is not, because properties of separate tools don’t add up in desired way. For example the simultaneous use of expensive tool and low-cost tool is still expensive; the simultaneous use of low-latency and high-latency is high-latency.  Nevertheless, toolbox approach is best one for real-world problems. Because real-world problems are usually decomposed to a number of sub problems where each may require different tool.
  • Most often trade-offs of real-world systems are temperate.


5 comments to CAP equivalent for analytics?

  • I like to quantify things during engineering tasks. It makes the knowledge and experience much easier to apply in practice.
    While the sophistication can be only categorized, the rest can be measured in gigabytes, seconds and dollars. And after assembling repository of measurements of known analytical database systems – we would be able to know what is a real cost of reducing latency or increasing volumes. In my opinion such research will be very good development of this article.

    • Camuel Gilyadov

      Well… some rudimentary numbers are there provided: High-Volume=1TB, LowLatency=subsecond, LowCost=1 $/Gb/month. Sophistication is also defined as either big table join or big table sort or big table select distinct. Big table meaning no fit in DRAM. Also you may consider looking into related post which this has grown up from. However more vigorous quantitative approach would be great, no doubt about it. As I find time I’ll run numbers through excel and publish my findings.

      Thanks for the comment!

  • Jordi Ferran

    On the CAP theorem the only real headache is in the consistency for all nodes. Nonetheless the data can be streamed, so in a sense it is always available (with a speed penalty). On this analytics scenario I believe it is much more complex and two elements are key.

    The high volume issue simply requires a divide and conquer strategy. With cloud computing in mind, this must be adressed using a hundred cheap computers. The data can be evenly distributed.

    The deep sophistication is the real challenge. I believe the data explotation is simple, and must be simple (join is strictly forbidden). For instance the use of a join on big tables is nonsense (for me). This is because a worst case scenario, which is possible, makes the solution not viable. I believe to be a big mistake to think on the data structure as it is, because this is an intel no one needs to know. The analyst works with a data model, an ontology model. It provides the schema for developing the questions.

    It is the storage system that assist into translating the questions into low level data queries. This does not mean to execute a one big question/request/query, but to analyze how to properly answer it on the future. If no best solution is available, then it is a problem that needs to be fixed. ¿How? Several options:
    - the easy is through data mutation,
    - the worst case scenario is to accept that the system gives up: the desired real-time less than a second response is not possible. The alternative, nonetheless, is a batch process that greatly minimize the impact through several queries. This is complex, heuristics, memory of previous experience, incremental search, small queries using the whole power of the cloud, and later recopile and analyze.

    Technically speaking the implementation of the storaging needs to be delegated. And this structure needs to mutate. A relational database management system seems to me painful, ¿why to store the data in rows? ¿why not all column values together? (column based) ¿why choose between one or the other? Nothing is perfect, so everything needs to evolve, to mutate, having both combined, or using a new one (NoSQL).

    The complexity is on the deep sophistication feature before the data layer, on a previous layer that needs to be developed. It is to assist on the analysis process, and to create from scratch the perfect code or data structure (subdivision of problems and solutions).

    For me the real challenge, even as a theoretical problem, is how to seamlessly migrate the source code developed for 20 years. A source code analyzer is interesting but humans are so stupid, that the analyzer logic may end crazy (on a deadlock) trying to understand a code that it is simply buggy.

  • [...] very happy about my SVLC-hypothesis, I think I knew it for a long time, but somehow only now, after I have put it on paper, I felt that [...]

  • [...] closeAdd a new related question:Type to search for questionsAddFinished EditingShare QuestionTwitterFacebookAboutJobsPrivacyTermsPress • LoginSign UpMobile SiteThere are some updates to this page [...]

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>