Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).
We got a lot of feedback and questions regarding the integration but not always had the time or patience to properly address them, sorry for that. But one of our readers, Zheng Xu, did a great job by putting together a slide deck on the exact procedure.
But there are still some points I need to address regarding the procedure he assembled there. It mostly boils down to Cloudfiles: although current Cloudfiles implementation has HTTP range support, our implementation uses our own code for the latter, therefore I really encourage ether using our Cloudfiles distribution (with patches) or changing our Hadoop code to use the new Rackspace one. Although the simple filesystem tasks will work as expected, any MapReduce job that works with big files will fail without correct HTTP range support.
I want to thank Zheng Xu for the effort and congratulate him on the success of his small experiment.