Content Copyright © 2011 Bloor. All Rights Reserved.
Hadoop sounds great but it has a number of issues associated with it. The first is that there are problems around high availability. In particular, Hadoop has a single NameNode. This is where the metadata is stored about the Hadoop cluster. Unfortunately, there is only one of them, which means that the NameNode is a single point of failure for the entire environment. If you don’t mind that then fine but otherwise you will either need a much more expensive and robust server to house the NameNode or you will need to take an alternate approach: there are several of these. One is to go with a different distribution of Hadoop such as MapR, which fixes the NameNode problem. Or there are companies such as ZettaSet that have built additional tooling around Hadoop, including NameNode high availability, but which do not fork the Apache distribution. Or, since this NameNode issue is specific to HDFS (Hadoop distributed file system), you could replace this with IBM’s GPFS-SNC, which similarly averts this problem. GPFS is also POSIX (portable operating system for UNIX) compliant, which HDFS is not.
Another associated problem is with the JobTracker. This is used to manage the MapReduce tasks and assign tasks to relevant servers (close to where the data is stored). Unfortunately, JobTracker, too, usually runs only on a single node, so it also represents a single point of failure. Fortunately, the same approaches that fix the NameNode issue will generally also handle JobTracker failures.
In addition to this menagerie, some traditional vendors are exploiting MapReduce directly within their own products. For example, Syncsort, in the newly announced DMExpress 7.0, supports MapReduce functions directly from its GUI. In other words you can define a data integration task directly with DMExpress, using traditional drag-and-drop methods, and it will take care of the exploitation of MapReduce for you. This is great for data integration but unfortunately it doesn’t help with query processing.
The third issue is that Hadoop and a number of associated products perform poorly. Again, various vendors have stepped into the breach. Thus MapR has re-written Shuffle so that it is 30% faster, while ZettaSet has made Pig multi-threaded. MapR has also made numerous other improvements and it estimates that these will halve your hardware requirements. Then again, Pervasive DataRush supports Hadoop clusters and it can be used either in conjunction with HDFS as an alternative to MapReduce or in conjunction with Hadoop, in either case providing significantly improved performance. Pervasive also has a product called TurboRush for Hive which the company claims improves the performance of Hive queries but with half the hardware. In internal benchmarks the product was out-performing native Hive (which you don’t have to change) by a factor of three.
Finally, Hadoop is not an easy environment to manage. Not surprising really, when you consider that you might have hundreds of servers in a cluster. Both alternate distributions (MapR and so forth) and build-around products (ZettaSet, BigInsights et al) aim to help here, and there is also the ZooKeeper project from Apache, which provides synchronisation, configuration management and other cross-cluster services.
The bottom line is that there are a lot of considerations around Hadoop. It is by no means a mature environment and it is likely that you will require multiple additional products to make it work properly, especially if you go down the open source route. If you are happy to go commercial then you will probably need fewer such add-ons but then, of course, you will have to pay for them.