What is Hadoop?

This is the first of I don’t know how many articles about “Big Data”. I was going to start the first article by saying that big data is not the same as Hadoop but then I realised that I had better describe what Hadoop is first, as otherwise that statement wouldn’t make sense. So you can take this as a sort of preface to the whole series.

Hadoop is a) an Apache Software Foundation development project and b) the original developer’s son’s stuffed toy elephant (hence the Hadoop logo and the menagerie of complementary projects named after animals). It has three bits to it: a storage layer, a framework for query processing and a function library. The last is not terribly important so I’ll concentrate on the first two. There are also loads of things built on top of or around Hadoop (Hive, Pig, Zookeeper et al) that are not part of Hadoop per se – I’ll come back to these in a further article.

The storage layer is HDFS (Hadoop distributed file system) and it stores data in blocks. Unlike a relational database, in which block sizes are typically 32Kb or less, in HDFS block sizes are, by default, 64Mb. This supports serial processing, as opposed to the random I/O that you find in a relational database. It is good for reading large volumes of data but useless for transaction processing or operational business intelligence.

Further, HDFS has redundancy built-in. Designed to run across a cluster of hundreds or thousands of low cost servers, HDFS expects those servers and their disk drives to break down on a regular basis. For this reason each data block is stored, by default, three times within a Hadoop environment. Note the implication of this: however much data you want to store you will require at least three times as much capacity as data (bearing in mind factors such as compression). This also has implications for processing functions, which are despatched to the servers on which the data resides rather than bringing the data to the processing, which is the norm in relational environments. A further point to note is that new data is always appended to what is already present: you cannot insert data.

OK, enough about HDFS for the time being, what about MapReduce? MapReduce is a programming framework for parallelising queries, which greatly speeds up the (batch – this is not suitable for real-time) processing of large datasets running, in conjunction with HDFS, on low-cost hardware. While it is usually thought of as having two steps (Map and Reduce) it actually has three. The first, the Map phase, reads the data and converts it into key/value pairs (a discussion of key/value databases will be the subject of a further article or articles). An example of a key/value pair might be “Name: Howard” “Employee number: 666” (just kidding) where Name is the key and Employee number is the value. Imagine that you had multiple employee files, then you would have a separate map task for each file.

Secondly, there is a process called Shuffle, which assigns the output from mapping tasks to specific Reduce tasks, which combine these results. All keys with the same value must be sent to the same reduce task.

There is obviously more to it than I have outlined here but these are the essentials. So, to return to the initial question: what is Hadoop? The short answer is that it is a support framework for MapReduce. The key word is “a”: you can run MapReduce on Aster Data (Teradata) so you don’t need Hadoop at all, you can deploy HBase, which is a column-oriented but non-relational database management system on top of HDFS, or you can deploy RainStor alongside HDFS, or you can replace HDFS with IBM’s GPFS-SNC or there are various other options. Why you might want to do any of these things are subjects for another day but the important thing to remember at this stage is that the key element is the query processing provided by MapReduce rather than anything else.