Challenging Cloudera

The market around Hadoop has historically been very gentlemanly. In most cases when open source products become popular you end up with multiple competing distributions. Look at Linux for example. However, that has not, until recently, been the case with Hadoop, with both IBM and Yahoo! announcing that they were going to stick with the standard distribution.

However, a few weeks ago EMC broke ranks and announced Greenplum HD Enterprise Edition. This is based around MapR Technologies’ Hadoop distribution, which is also now available from MapR itself.

The question is: why did EMC do it? The answer is simply that MapR is so much better than the standard Hadoop distribution, especially for enterprise class deployment.

There are two big problems with the standard version of Hadoop. The first is that it is not very resilient. It is not even that there is a single source of failure; in fact there are multiple single points of failure. For example, JobTracker runs on a single node and if that node fails then so does JobTracker and, since JobTracker is the Hadoop Service that farms out MapReduce tasks to specific nodes within the Hadoop environment, then all your analytics will also fail.

Similarly, there is only a single NameNode (which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept) in a standard Hadoop implementation. There is an optional SecondaryNameNode that can be hosted on a separate machine but it only creates checkpoints and does not provide any real redundancy. There is also a BackupNameNode that is part of a plan to support high availability but which does not exist right now.

MapR’s distribution fixes both of these issues, supporting what it calls JobTracker HA and NameNode HA so that there is no single point of failure within an Hadoop cluster. The company also offers the capability to mirror data across clusters, using asynchronous replication, to support failover and disaster recovery for the whole cluster.

A further problem with NameNode is that it limits the number of files you can support to somewhere between 70 and 100 million (depending on the server). This may sound a lot but actually isn’t. There are workarounds that are available but MapR has removed this limit and will support a trillion files.

The second and third big issues with Hadoop are that both compression and performance are poor. Without going into all the details (a couple of examples are that MapR has re-written the Shuffle phase [which is an intermediate stage after Map and before Reduce] so that it performs more than three times faster and it has replaced Java garbage collection with C) the company reckons that its distribution will run between 2 and 5 times faster (average 3 times) and will require half the hardware resources.

The fourth big issue with Hadoop is that Hive, which provides SQL access to Hadoop, is very slow and lacks important functionality. However, that’s got nothing to do with MapR and is a subject for another day.

Up until now it has been the received wisdom that Cloudera is the preferred distribution of Hadoop for enterprise-class environments. With all of the features just mentioned, plus other significant features that I don’t have space to discuss in detail (such as the graphical management console for Hadoop clusters), I think that position has changed and it seems to me that MapR is significantly ahead of the rest of the market in terms of its capabilities and is likely to stay that way for a considerable period.