All things Hadoop

There are a couple of things that the Hadoop community is currently obsessing about. One is Spark and the other is the Open Data Platform (ODP).

Taking the latter first, the ODP is a collaboration between HortonWorks, IBM, Teradata, Pivotal, SAS and a bunch of others to create a standardised version of Hadoop (include HDFS, MapReduce, YARN and Apache Ambari). Notable by their absence from ODP are Cloudera, MapR, Oracle, SAP and Microsoft.

There are several things that are interesting about this. The first is the very existence of ODP, the second is the vehemence of the attack that Cloudera has made upon ODP and the third is whether, with the advent of Spark, it is relevant anyway?

The first two points are clearly linked. Cloudera’s view is that companies like Pivotal and IBM (and even HortonWorks) have invested heavily in Hadoop technologies but have not really gained the traction that those investments have warranted. There’s some truth in this – an IBM executive asked me yesterday whether I thought the market really knows much about IBM BigInsights – it probably doesn’t. On the other hand, is this is a pre-emptive attack by Cloudera because it is running scared? There is no doubt that the vendors in the ODP have a significant amount of clout.

My view is that it doesn’t really matter. The truth is that the stuff that the ODP is concentrating on is not differentiating: HDFS is storage technology, MapReduce is a programming framework and YARN and Ambari are about management. It makes quite a lot of sense for vendors to cut costs by collaborating on a common standard at this level but in the end it will make no difference because most people don’t buy boxes they buy solutions. What ODP has recognised, and Cloudera apparently hasn’t, is that at this level Hadoop is a commodity and ODP is simply about the commoditisation process.

The third question is whether Spark changes this story? There have been a lot of misinformed articles about Hadoop vs Spark in the press and blogosphere lately because that’s not the right question. It is really about MapReduce vs Spark: HDFS remains the same although Spark will also work with other NoSQL databases. I have a couple of comments. Firstly, by all accounts Spark provides much better performance, both for conventional programming and SQL. Secondly, while I am not convinced that Spark is ready for prime time in the commercial marketplace yet, I have talked to a lot of vendors that are building products on top of Spark: that suggests that in time it may well displace MapReduce. In which case, yes, this does impact on ODP.

Finally, Spark is not always as competitive with the Hadoop landscape as you might think. In so far as this point is concerned, consider Spark Streaming versus Storm. The former is significantly faster than the latter but they are not comparable products. This is because Storm is event based: it processes individual events. However, Spark Streaming is windows (small “w”) based. That is, it batches events up and processes those batches rather than individual events, which at least partly explains why it is so much faster than Storm. Now, there are applications where that is fine but there are also applications where it isn’t.

The point is that you need to be careful about what you are comparing. ODP is about stuff where any comparisons are pretty much a waste of time, Spark isn’t a competitor to Hadoop and it isn’t always a competitor to what you might think it competes with.