By David Gruzman, on June 18th, 2014

Multi-engine data processing

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true.
However, HDFS has brought us a new opportunity. We have a platform to share data among different database engines. Let me elaborate.
Today there is a number of database engines built on top of HDFS. Such as Hive, Pig, Impala, Presto, etc. Many Hadoop users develop their own partially generic map-reduce jobs which could be called specialized database engines.
Before we had such a common platform, we had to select database engine for our analytic and ETL needs. It was a tough choice and it was lame. Selected engine owns data by storing it in its own format. We were forced to write some custom data operations not in the language of our choice, but
in database’s internal language – like PL-SQL, Transact-SQL, etc… When our processing became less relational, we wrote a lot of logic in database internal language and had to pay database licence in order to run our own code…
The first spring bird of change was Hive’s external tables. At first glance, it was a minor feature – instead of letting Hive select the place and the format for the data, we got the chance to do this job for it. Conceptually, it was a big change – we are beginning to own our data. Now it can be produced by our own MR jobs, it can be processed in any way we like aside of Hive. And at the same time we can use Hive to query these data.
Then came Impala, and instead of defining its own metadata, it reuses the data from Hive and thus inherits those wonderful external tables. So we have two engines capable of processing the same data. As far as I remember, for the first time.
Why is it so important? Because we are no longer restricted by the given engine problems and bugs. Let’s say Impala is perfect at running 5 out of our 6 queries, but fails at the last one. What do we do? Right, we can run 5 queries using Impala and the last one using Hive, which is slower, but is a much less demanding creature than Impala.
Then came Spark with its real time capabilities, Presto from the Facebook, and, I am sure, many more are to come. How to select the right one?
So, in my opinion, we should stop selecting, but rather combine them. We should shift to single data – multiple engines paradigm. Thanks to the gods of the free open source software it does not involve licensing costs.
What makes it possible: common storage, open data formats and shared metadata. HDFS is a storage, data formats are CSV, Avro, Parquet, ORC, etc., and shared metadata is Hive metastore and HCatalog as its successor.
We do not have to do a revolution, but I think we should prefer engines running on top of common storage (like HDFS) data formats, which are understandable to many engines and database engines that do their best at data processing and give us freedom of data placement and formats.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>