By David Gruzman, on December 10th, 2014

ImpalaToGo announcement

During my work in BigDataCraft.com I saw repeating problem our customers face. The problem is how to get efficient SQL on big data in the cloud.

Lets see a typical case.

First case – daily logs of some nature arrived and stored in the S3. There is a need to do . . . → Read More: ImpalaToGo announcement

By David Gruzman, on November 9th, 2014

Efficient usage of local drives in the cloud for big data processing

Cloud sounds like a perfect platform for the big data processing – you get as much processing power when you need it and release when you don’t. But why does a lot of big data processing happen outside of cloud? Lets try to find out:

The question came from following dilemma in big . . . → Read More: Efficient usage of local drives in the cloud for big data processing

By David Gruzman, on June 18th, 2014

Multi-engine data processing

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true. However, . . . → Read More: Multi-engine data processing

By Constantine Peresypkin, on October 2nd, 2012

Network virtualization for the Cloud: Open vSwitch study

In face of the current reality of ten thousand node data-centers and all the BigData jazz it seems like the network guys were slightly forgotten. We have enough hardware virtualization solutions but until now the network was left on the outskirts of the cloud hype. Let’s see what we can use right now and . . . → Read More: Network virtualization for the Cloud: Open vSwitch study

By Camuel Gilyadov, on September 24th, 2012

What does BigData mean?

The full deck available at Continue reading What does BigData mean?

By Camuel Gilyadov, on September 14th, 2012

Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching . . . → Read More: Apache Drill Design Meeting

By Constantine Peresypkin, on September 8th, 2012

Hadoop on OpenStack Swift: experiments

Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).

We got a lot of feedback and questions regarding the integration . . . → Read More: Hadoop on OpenStack Swift: experiments

By Camuel Gilyadov, on September 4th, 2012

Progress on Apache Drill

We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.

Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within . . . → Read More: Apache Drill Progress

By Camuel Gilyadov, on August 25th, 2012

Apache Drill Proposed Design

We are not longer alone implementing Google Dremel and BigQuery technology. A proposal was made recently to Apache Foundation suggesting similar project. Moreover Ted Dunning kindly invited us to take part in the project.

The project is just starting now and there is no source code yet and not even a consensus design. So . . . → Read More: Apache Drill

By Camuel Gilyadov, on July 7th, 2012

Start-Up Chile

I’ve been frequently asked about my experiences in Start-Up Chile program. For the past half year that I’ve been participating in the program I could say that it was interesting and fulfilling experience.

On top of provided seed capital you get a supporting framework of mentors and fellow startupists. You can literally “feel” the surrounding  entrepreneurial spirit. And . . . → Read More: Start-Up Chile

By Camuel Gilyadov, on March 1st, 2012

Apache Hadoop over OpenStack Swift

This is a post by Constantine Peresypkin and David Gruzman. Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn’t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc…) Before we go . . . → Read More: Apache Hadoop over OpenStack Swift

By Camuel Gilyadov, on February 11th, 2012

Futility of "tooling" a proprietary cloud.

I’v been pitched by a lot of entrepreneurs trying to make a better-than-original “tooling” for a proprietary cloud, particularly for AWS. Ain’t the attempt futile from the beginning? Amazon is smart, innovative and working hard to make its cloud offering comprehensive and has much larger arsenal to overdo anyone who dare to compete on their own turf. . . . → Read More: Futility of “tooling” a proprietary cloud.

By Camuel Gilyadov, on February 7th, 2012

OpenDremel update and Dremel vs. Tenzing

I wasn’t blogged for whole 2011 year… I’m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also . . . → Read More: OpenDremel update and Dremel vs. Tenzing

By Camuel Gilyadov, on January 17th, 2011

Upcoming hardware renaissance era: part #2.

Some examples of upcoming hardware renaissance era:

1. Virtually all server vendors are pitching modularized data centers by now. MDC are boxes resembling shipping containers accommodating complete vritualized data-center inside. With MDC one just connects power, network and chilled water and gets access to the cloud in the box. Most MDC are good to . . . → Read More: Upcoming hardware renaissance era: part #2.

By Camuel Gilyadov, on November 17th, 2010

Emerging Proprietary Hardware Renaissance

INTRO I cannot count number of times I heard that cloud computing means innovation stagnation in the proprietary hardware business and that with cloud computing, hardware doesn’t matter anymore and will succumb sooner or later into boring razor-thin-margins oligopolistic commodity industry.

GAME OVER FOR FAT MARGINS IN PROPRIETARY HARDWARE? Why folks think like that? . . . → Read More: Emerging Proprietary Hardware Renaissance

By Camuel Gilyadov, on October 17th, 2010

Two Envelopes Problem: Am I just dumb?

It seems the recent craze about statistician being a profession of choice in the future gains steam. In future where we will be surrounded by quality BigData, capable computers and bug-free open source software including OpenDremel. Well the last one I made up… but the rest seems to be the current situation. Acknowledging this . . . → Read More: Two Envelopes Problem: Am I just dumb?

By Camuel Gilyadov, on October 13th, 2010

Debunking common misconceptions in SSD, particularly for analytics

1. SSD is NOT synonymous for flash memory.

First of all let’s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.

. . . → Read More: Debunking common misconceptions in SSD, particularly for analytics

By Camuel Gilyadov, on October 12th, 2010

Google Percolator: MapReduce Demise?

Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .

Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!

MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it’s “karma” has suffered a blow. Beforehand you could end any MapReduce dispute by . . . → Read More: Google Percolator: MapReduce Demise?

By Camuel Gilyadov, on October 11th, 2010

How scalable is linux kernel on 48-core machine?

According to this excellent and comprehensive research with some kernel hacking ~x33 speedup (compared to single core) is possible. For example PostgreSQL running on 48 cores gives ~x4  out of the box and after kernel/postgreSQL patches are applied it grows to ~x33. Assuming IO can keep up of course.

By Camuel Gilyadov, on October 11th, 2010

Is NoSQL a DBMS?

Yes, it is.

Proof? – By definition.

But Wikipedia…… – fixed.

By Camuel Gilyadov, on October 8th, 2010

CAP equivalent for analytics?

CAP theorem deals with trade-off in transactional system. It doesn’t need an introduction, unless of course you have been busy on the moon for last couple of years. In this case you can easily Google for good intros. Here is a wikipedia entry on the subject.

I was thinking how would I build an . . . → Read More: CAP equivalent for analytics?

By Camuel Gilyadov, on October 8th, 2010

Analytics Patterns

Unsatisfied by my previous post‘s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The . . . → Read More: Analytics Patterns

By Camuel Gilyadov, on October 8th, 2010

Feature list of ultimate BigData analytics

Volume Scalability => the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB. Latency Scalability => the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week. Sophistication Scalability => the solution . . . → Read More: Feature list of ultimate BigData analytics

By Camuel Gilyadov, on October 7th, 2010

Terminology: Analysis vs. analytics advanced analytics

I see a lot of confusion in the usage of newer terms in analytics. I do confuse them myself occasionally. I find it funny that the industry as serious as analytics tolerates constant renewal of its basic terminology. Yet, I confess, I’m very guilty of it myself. I do enjoy the freshness and the novelty . . . → Read More: Terminology: Analysis vs. analytics and more…

By Camuel Gilyadov, on October 1st, 2010

The story behind this blog

Continue reading The story behind this blog