Big data storage options

Classically when you think about Hadoop you think about HDFS and MapReduce. Together they define Hadoop. But there are a growing number of extensions and replacements which means that you need to qualify what you are talking about when you simply mention the yellow elephant.

To begin with there are Apache projects such as HBase that build on top of Hadoop. These are fine in the sense that both HDFS and MapReduce remain in place and therefore such extensions are genuinely Hadoop-based.

However, what are we to make of companies like MarkLogic? If you google “MarkLogic” you will find an entry for “MarkLogic & Hadoop: better together”. What the company actually means by this is that it has a MarkLogic implementation, in its labs, based on HDFS. And, yes, the “H” in HDFS stands for Hadoop but it is not actually Hadoop per se.

Next we have DataStax. On its website it suggests “get DataStax Enterprise and experience a better Hadoop”. Now, DataStax supports Hadoop functionality but in practice it actually replaces HFDS with Cassandra though this is under the covers where the developer can’t see it. Is this really Hadoop if it doesn’t have HDFS as its storage engine?

Not that DataStax is alone: IBM has its GPFS replacement for HDFS, or you can replace it with RainStor or, shortly, you will be able to use InterSystems’ Globals database (the database part of Caché) as a replacement for HDFS. No doubt there are other options also.

So there are four questions here: firstly, is it really Hadoop if you have replaced part or all of it? Secondly, if that is the case then isn’t Hadoop just a buzzword rather than a technology? Thirdly, do we actually care? Finally, and perhaps most importantly, if HDFS in particular is constantly being replaced by other storage engines then which is actually the best engine? And the answer to this last question isn’t clear cut: for example, you would choose RainStor if you wanted extreme compression capabilities but might choose Globals or GPFS for entirely different reasons. So it seems likely that we will be having a major debate in the coming months and years about where the various storage options are most appropriate.

Nor is this debate confined to Hadoop. Tokutek (the MySQL storage provider) has just posted a blog about deploying its fractal tree indexing against MongoDB. Like MarkLogic this is still in development but is another pointer to the future: while a lot of the focus around NoSQL databases has been around extending their YesSQL aspects there is a growing focus on the storage engines themselves.