By David Gruzman, on November 9th, 2014

Efficient usage of local drives in the cloud for big data processing

Cloud sounds like a perfect platform for the big data processing – you get as much processing power when you need it and release when you don’t. But why does a lot of big data processing happen outside of cloud? Lets try to find out:

The question came from following dilemma in big . . . → Read More: Efficient usage of local drives in the cloud for big data processing

By David Gruzman, on June 18th, 2014

Multi-engine data processing

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true. However, . . . → Read More: Multi-engine data processing

By Camuel Gilyadov, on September 14th, 2012

Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching . . . → Read More: Apache Drill Design Meeting

By Camuel Gilyadov, on October 13th, 2010

Debunking common misconceptions in SSD, particularly for analytics

1. SSD is NOT synonymous for flash memory.

First of all let’s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.

. . . → Read More: Debunking common misconceptions in SSD, particularly for analytics

By Camuel Gilyadov, on October 12th, 2010

Google Percolator: MapReduce Demise?

Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .

Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!

MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it’s “karma” has suffered a blow. Beforehand you could end any MapReduce dispute by . . . → Read More: Google Percolator: MapReduce Demise?

By Camuel Gilyadov, on October 8th, 2010

CAP equivalent for analytics?

CAP theorem deals with trade-off in transactional system. It doesn’t need an introduction, unless of course you have been busy on the moon for last couple of years. In this case you can easily Google for good intros. Here is a wikipedia entry on the subject.

I was thinking how would I build an . . . → Read More: CAP equivalent for analytics?

By Camuel Gilyadov, on October 8th, 2010

Analytics Patterns

Unsatisfied by my previous post‘s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The . . . → Read More: Analytics Patterns

By Camuel Gilyadov, on October 8th, 2010

Feature list of ultimate BigData analytics

Volume Scalability => the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB. Latency Scalability => the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week. Sophistication Scalability => the solution . . . → Read More: Feature list of ultimate BigData analytics