ImpalaToGo announcement

During my work in BigDataCraft.com I saw repeating problem our customers face. The problem is how to get efficient SQL on big data in the cloud.

Lets see a typical case.

First case – daily logs of some nature arrived and stored in the S3. There is a need to do a few dozens reports over new data each day.  Daily data size is from dozens to hundreds of gigabytes. There is also need to do ad-hoc reports on the older data.

What are our options?

  • to run Hive over S3. It will work, but slow. Transer from S3 will be big part of execution time. Hive itself is slow. Data will be transferred between S3 and EC2 instances for each query. We can do things a bit better by manually scripting to keep recent data in the local HDFS.

  • to run efficient engine – Cloudera Impala.  Problem that it does not work with S3 directly. To go this way we will need to script moving data to local HDFS, and cleaning it up when it outgrow the local space. In other words – we will need to treat local HDFS as a cache manually.

  • To use Redshift. It is good option, but will also require managing moving data from s3. it is also proprietary and will lock us in the AWS. And it is expensive if we do not commit for a several years reservations.

Second case is extension of first one. We have some data accumulated in the S3. And our data science team wants to slice and dice some data to find some interesting XYZ. What should they do?  Data scientists usually not so IT savvy to build and manage own hadoop cluster. Sometimes they just want to get SQLs on subset of data with interactive speeds.

So today they have the following options

  • To ask management to build Hadoop cluster with Impala , or Spark and ask DevOps to script data transfer , etc

  • They ask management for expensive MPP database and maintain modest amount of data inside, to avoid hefty per terabyte licensing

  • To master usage of EMR.

Just as a summary – it is very easy to store data in S3, but it is not easy to get efficient SQL over it. I

We decided to do live a bit easier for people who wants to analyze data living in the s3.

How we did it?

Conceptually we see the solution in organizing fast local drives and remote S3 into one storage. This blog post is devoted to this idea – http://bigdatacraft.com/archives/470

In order to field-test this concept we built ImpalaToGo – modification of Cloudera Impala which is capable of working directly with S3, while using local drives as a cache.

In practice it means the following advantages for the above cases:

  • We can define external tables over S3 data, in the same way we used to do it in Hive.

  • When we repeat query on the same data, it runs 3-10 times faster since data is accessed from local drives, which are much faster than s3.

  • When local drives are getting full – least used data will be deleted automatically, to spare you to manage it.

If you want to try it – we have pre built AMI and documentation how to organize them into cluster. It is completely free and open source (https://github.com/ImpalaToGo/ImpalaToGo/wiki/How-to-try).

If you have any questions – please write to david@bigdatacraft.com

 

Efficient usage of local drives in the cloud for big data processing

Cloud sounds like a perfect platform for the big data processing – you get as much processing power when you need it and release when you don’t. But why does a lot of big data processing happen outside of cloud? Lets try to find out:

The question came from following dilemma in big data processing in cloud :
Store data in S3 and process in EC2. It is elastic and economical per GB, you can resize you cluster as you wish, but you are limited by S3 bandwidth. EMR against S3 is popular example of this approach.

Or, you can also build HDFS or other distributed storage on top of local (ephemeral) drives. There appears to be a clear tradeoff: good bandwidth is available, but storage is going to be expensive and elasticity will suffer, because you can not remove nodes when their processing power is not needed. Redshift or hadoop with HDFS on local drives are the perfect examples of this approach.

Both solutions have drawbacks. Lets take a closer look.
Cloud storage, like s3, is built to store a lot of data in cheap and reliable way. Circa $30 per TB per month. It is also highly reliable: SLA with a lot of nines…
Local drives should be fast. Today it means SSD. This technology provides very good performance but price per GB is high.

The cost of HDD space is 20-25 times lower than on SSD. In Amazon cloud difference in cost of local drive space vs s3 space is even higher. For example 1TB of storage on c3.8xlarge instances costs aroud $2K per month. It is x60 (sixty times!!!) more expensive than to store data in s3.

What about throughput? The difference between access to local data and data on S3 is around 5 times. Moreover, bandwidth to S3 can be throttled by amazon, depending on the current load and other factors. As opposed to always reliable access to local drives.

There is possible counter-argument, that we do not need this storage bandwidth. Assuming that we process data in a speed matching the storage bandwidth – we do not need more of it. S3 can give us 50-100 MB/sec of data for big instance, like c3.8xlarge. If we process data using MapReduce or Hive – it is close to processing speed assuming MR processing to be about 5MB/Sec per core.
In case of more efficient engines – like Redshift or impala – the speed is about 100MB/sec per core or more…
So, we need faster access. To prove this point, you can pay attention that RedShift nodes has 2.4 GB/sec of disk IO. I can trust that AWS engineers know what they are doing.

Now lets recall that usually big data is a huge pile of cheap data. By cheap I mean – low value per GB. Should data be expensive (like financial transactions) it could happily live in Oracle + enterprise storage.

So, how do we utilize our resources more efficiently? On one hand we have a lot of slow and inexpensive storage, and on the other a bit of fast and expensive. The obvious solution is to use fast storage as cache. These days it is rather common: DRAM memory holds disk cache, SRAM memory inside CPU used as cache for DRAM.

In the above situation I suggest to use local SSD drives as a cache for cloud storage (s3).
So, what does it mean? Effective cache should meet the following requirements:
Store hot data set. It is main duty of the cache. Usual heuristics is LRU – last recently used. We assume that data recently used has good chance to be used again.

Prefetch: Predict what data will be needed and load it ahead of time. In case of disk cache – it’s read ahead (if we read first block of the file there’s a good chance we will need the next). In case of CPU – there are very advanced algorithms for pre-fetch. In case of usual data warehouse we can assume that recently added data has better chance to be of interest than old one…

To summarize: I believe that to be able to efficiently process big data in the cloud we need to use local drives as a cache of the dataset stored in the cloud storage. I also believe that other ways will not be efficient, as long as cloud storage is much slower and cheaper than local drives.

Multi-engine data processing

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true.
However, HDFS has brought us a new opportunity. We have a platform to share data among different database engines. Let me elaborate.
Today there is a number of database engines built on top of HDFS. Such as Hive, Pig, Impala, Presto, etc. Many Hadoop users develop their own partially generic map-reduce jobs which could be called specialized database engines.
Before we had such a common platform, we had to select database engine for our analytic and ETL needs. It was a tough choice and it was lame. Selected engine owns data by storing it in its own format. We were forced to write some custom data operations not in the language of our choice, but
in database’s internal language – like PL-SQL, Transact-SQL, etc… When our processing became less relational, we wrote a lot of logic in database internal language and had to pay database licence in order to run our own code…
The first spring bird of change was Hive’s external tables. At first glance, it was a minor feature – instead of letting Hive select the place and the format for the data, we got the chance to do this job for it. Conceptually, it was a big change – we are beginning to own our data. Now it can be produced by our own MR jobs, it can be processed in any way we like aside of Hive. And at the same time we can use Hive to query these data.
Then came Impala, and instead of defining its own metadata, it reuses the data from Hive and thus inherits those wonderful external tables. So we have two engines capable of processing the same data. As far as I remember, for the first time.
Why is it so important? Because we are no longer restricted by the given engine problems and bugs. Let’s say Impala is perfect at running 5 out of our 6 queries, but fails at the last one. What do we do? Right, we can run 5 queries using Impala and the last one using Hive, which is slower, but is a much less demanding creature than Impala.
Then came Spark with its real time capabilities, Presto from the Facebook, and, I am sure, many more are to come. How to select the right one?
So, in my opinion, we should stop selecting, but rather combine them. We should shift to single data – multiple engines paradigm. Thanks to the gods of the free open source software it does not involve licensing costs.
What makes it possible: common storage, open data formats and shared metadata. HDFS is a storage, data formats are CSV, Avro, Parquet, ORC, etc., and shared metadata is Hive metastore and HCatalog as its successor.
We do not have to do a revolution, but I think we should prefer engines running on top of common storage (like HDFS) data formats, which are understandable to many engines and database engines that do their best at data processing and give us freedom of data placement and formats.

Network virtualization for the Cloud: Open vSwitch study

In face of the current reality of ten thousand node data-centers and all the BigData jazz it seems like the network guys were slightly forgotten. We have enough hardware virtualization solutions but until now the network was left on the outskirts of the cloud hype. Let’s see what we can use right now and if it will get better in the future.

When people talk about network virtualization nowadays one name immediately springs into mind: Nicira, they invented OpenFlow, Open vSwitch (OVS) and… were acquired by VMware.

Why Nicira? They essentially designed the current state of network virtualization. OpenFlow is implemented in physical hardware and OVS is used by a lot of people to drive the software network stack in virtualized environments.

But is it any good? Let’s see. If you open the specification for OpenFlow it looks simple: let’s cut the hardware intervention at the Ethernet level and implement all other features in software. We essentially write a program (handler) that matches some fields in packet and acts according to simple rules: forward to port, drop, pass to other handler. But then, how do you install these handlers inside the switch? The solution is also not that complicated: you just write another more complex software that runs on something generic (like PC). It chooses handlers for particular flows by issuing a command to the switch, when switch encounters something it does not have handler for, it just passes it to this PC (controller) and controller either chooses a new handler for the switch or processes the packet internally.

Open Flow switch

 

What do we see here? It looks like there is an execution platform inside the hardware for running the network handlers and a controller which chooses the handler for each state. It looks very promising and flexible, and can be implemented not only in hardware but also in software only. And the same guys implemented it in OVS, shall we peek inside? Yes, I’ve answered to myself, and downloaded the OVS source.

When I looked inside the code I was little… how should I put it… surprised? OVS code has everything but a kitchen sink inside: serializeable hash table, JSON parser, VLAN tagging, several QoS implementations, STP implementation, Unix socket server, RPC over SSL server and the icing on the cake: their own database with a key/value + columnar storage engine. And everything implemented from scratch (or so it seems).

Ok, they have enough shock value already, but how does this thing work? It turns out that the operation is not that different from the hardware I’ve described above. It just has a kernel module instead of actual hardware and the flow handlers are just some functions inside the module. It looks like this.

Open vSwitch

Daemon uses netlink to talk to kernel module and install the handlers, database stores the configuration, controller talks to daemon via OpenFlow or plain JSON.

So, we got a software stack for network, why is it good for virtualization?

Short answer: because everybody uses Linux. And when your hypervisor runs on Linux why not use some of its capabilities for a nice network boost. But why OpenFlow/OVS?

The OVS docs describe it like this:

Open vSwitch is targeted at multi-server virtualization deployments, a landscape for which the previous stack is not well suited. These environments are often characterized by highly dynamic end-points, the maintenance of logical abstractions, and (sometimes) integration with or offloading to special purpose switching hardware.

But Linux always had a good network stack, easily manageable and extendible. What are the advantages? There are some.

  • OVS can migrate the network state with the bridge port (connected to the VM instance, for example). You will loose some packets, but the connections may still stay intact.
  • OVS can use metadata to tag frames (VM instance UUID, for example).
  • OVS can install event triggers inside the daemon to notify the controller on the state change (VM reboot, for example).

Does it justify a new kernel module + new DB engine + new RPC protocols? Maybe not. But you can’t get these capabilities from any other open source software anyway.

The only question I have right now is why Nicira did not implement STT in OVS, but certainly did in its proprietary NVP software?

Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching technical aspects in follow on discussion. After a pizza break, Julian Hyde presented his view on logical/physical query plan separation and suggested using optiq framework for DrQL optimizer.

Overall my take away are as follows:

  1. There is very healthy interest in interactive querying for BigData.
  2. There were not even a single voice calling on making up vanilla Hadoop for this task.
  3. There is a general consensus on plurality of query languages and plurality of data formats.
  4. There is a general consensus that user always should be given freedom to supply manually authored physical query plan for execution, bypassing optimizer altogether and as opposed to hardcore hinting.
  5. Except me no one tried to challenge “common logical query model” concept. Since there are no real joins in Dremel and no indexes and only one data source with exactly one possibility – a single full table scan, I cannot see the justification for the complexity of optimizers and the logical query model. Dremel is an antidote concept to all this.

Thank you – MapR, for the Drill initiative, the great design meeting and the invitation.

Hadoop on OpenStack Swift: experiments

Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).

We got a lot of feedback and questions regarding the integration but not always had the time or patience to properly address them, sorry for that. But one of our readers, Zheng Xu, did a great job by putting together a slide deck on the exact procedure.

But there are still some points I need to address regarding the procedure he assembled there. It mostly boils down to Cloudfiles: although current Cloudfiles implementation has HTTP range support, our implementation uses our own code for the latter, therefore I really encourage ether using our Cloudfiles distribution (with patches) or changing our Hadoop code to use the new Rackspace one. Although the simple filesystem tasks will work as expected, any MapReduce job that works with big files will fail without correct HTTP range support.

I want to thank Zheng Xu for the effort and congratulate him on the success of his small experiment.

Apache Drill Progress

We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.

Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within a few days, the parser will be committed to the new Drill repository as a first phase of the OpenDremel-Drill merge.

Next on, we plan to refactor and contribute the Semantic Analyzer, which processes the output of the parser into an intermediate form, resolving references and rewriting (flattening) the query into single full table scan operation. That is expected within a week or two, it would depend when the Drill architecture doc will be published. We still don’t know what will be the schema language/format. Will it be Protobuf? Avro? OpenDremel supports Avro right now and has an initial support for Protobuf.

The final phase of OpenDremel – Drill merge, will be the contribution of the code generator based on the Apache Velocity templates. We have two sets of templates for now: one is a Java-based and executed with Janino executor and second one uses C/asm and executed with ZeroVM executor.

Everyone who wishes to help is welcome. The OpenDremel code resides in its usual Google code repo – http://code.google.com/p/dremel/. BE SURE TO LOCATE AND USE REPO COMBO BOX on the upper part of the page.

We probably will use https://github.com/ApacheDrill repo as a staging area or the Apache git repo directly, it all depends on what will be proposed by Ted Dunning – the Apache Drill Champion.

We also continue work on our generic execution backend built on top of OpenStack Swift and integrated with ZeroVM. We are contributing to both projects here.

We look ahead to Apache Drill with pluggable frontends and pluggable backends. So it would be able to run on top of a toy single-JVM Janino backend, or under YARN management on HDFS with Janino or ZeroVM backend, or even on a Zwift backend (that’s how we codenamed OpenStack Swift + ZeroVM combo).

On other hand the frontends will be pluggable too, so, in the future, support for new languages such as Apache Pig or Apache Hive can be added easily. Another option would be to create single frontend with pluggable language handlers, that would allow us to embed functionality from other projects such as Apache Mahout or R.

Apache Drill

We are not longer alone implementing Google Dremel and BigQuery technology. A proposal was made recently to Apache Foundation suggesting similar project. Moreover Ted Dunning kindly invited us to take part in the project.

The project is just starting now and there is no source code yet and not even a consensus design. So we sat together today evening and wrote a proposed design for Apache Drill. We already working for about two years on Dremel and BigQuery implementation. It was a fascinating journey and we have learned quite a lot and would be more than happy to share our experiences and accumulated knowledge.

All our code (OpenDremel/Dazo/ZeroVM) has Apache License from the beginning and used several Apache technologies from Avro to Velocity. Apache seems to be best home for Drill project and we are looking forward to contribute to it.

Start-Up Chile

I’ve been frequently asked about my experiences in Start-Up Chile program. For the past half year that I’ve been participating in the program I could say that it was interesting and fulfilling experience.

On top of provided seed capital you get a supporting framework of mentors and fellow startupists. You can literally “feel” the surrounding  entrepreneurial spirit. And despite me being unlucky to find peer support with my infrastructure BigData@Cloud idea (most folks were doing consumer web kind of startups) I did found the framework highly encouraging.

Provided capital is equity-free which is especially nice and makes negotiating next financing round easier. Getting the money is paperwork-intensive process but the staff are friendly and helping.

I found Chileans hospitable and friendly to foreigner. Yet minimal Spanish seems to be mandatory. I found myself speaking Spanish after a few month in Santiago and that was  unplanned initially.

Santiago is a nice mountain-surrounded modern city and pretty safe I would say. I cannot count how many times locals warned me on how unsafe Santiago really is, but except permanently going strike/riot in the central part of the city I never experienced and never witnessed or heard about any incident. And I’m usually working deep into the night and walk extensively before retiring to bed. I lived in Centro but especially enjoyed walking in west-northern part of the town. Underground transprtation is quite efficient to get around, a little hot during mid-day in February I remember. I was mostly fully consumed by my startup so haven’t enough time to tour the rest of the country, and even Santiago only from walking experience guided by GPS in my Nokia. I really should rent a car one weekend and get out for a couple of days… In fact I did one weekend in Vinna del Mar / Valparaiso and found it quite a nice and relaxing place.

The local entrepreneourship and geekish community is also thriving and this is not including very visible Start-Up Chile folks. Go to meetup.com and choose your favorite topic or technology and I bet you will find a packed santiago interest group there.