Big data

There is a lot of confusion around “big data”. People naturally assume that big data means lots of data. Which is true. But it isn’t any old data or, at least it doesn’t have to be. The reason I bring this up is because just last week I heard about a company investigating the possibility of using Hadoop to store and support the analysis of several years’ worth of transactional data.

Now, it is possible to think of reasons to use Hadoop for this purpose: it might be cheaper or you might prefer Java programmers to SQL developers but this is not the sort of environment where Hadoop would naturally spring to mind as an application. Moreover, I don’t care how large your organisation is, you won’t need huge quantities of disk for a few years of transactions. This isn’t, relatively speaking, “big” data, it’s actually pretty small data but if you are used to storing only 3 months’ worth of data then maybe it looks big.

So we need to be clear about what we mean by big. Generally speaking we are at least talking about hundreds of terabytes and more often petabytes.

The next thing to think about big data is what sort of data it is. Hadoop and MapReduce are particularly useful when it comes to analysing semi-structured and unstructured data, while traditional data warehouses, using traditional analytic techniques, are not. On the other hand, you can do things with structured data in a conventional data warehouse that would be much more difficult to do using MapReduce. So there is a good case for treating Hadoop and MapReduce on the one hand and data warehousing on the other, as complimentary. However, if you are going to start looking at using Hadoop for transactional data then they become competitive, which is something else entirely.

A contributing confusion is the suggestion that big data can be equated with machine-generated data. By machine-generated data I mean anything that does not originate with someone keying in something like an order somewhere. So, anything generated from the Internet (Twitter feeds, Facebook pages, linked-in, clickstream data and so on) as well as from computer operated machine tools, environmental monitoring devices, RFID sensors, smartphones, stock market ticks (complex event processing) and so forth.

There are two things wrong with this. The first is that I think “machine” is the wrong word here: I don’t believe that anyone thinks of the Internet as a machine, or their cell phone for that matter. I prefer the term auto-generated data.

But the relationship with big data is more important. It is certainly true that all of these sources can generate lots of data, though often a lot of it gets filtered out. However, a Twitter feed is fundamentally different from, for example, a SCADA device in that the former generates unstructured data and the latter generates structured data. Hadoop, going back to my earlier point, could be ideal for storing the Twitter-based information and then you can use MapReduce for sentiment analysis. But traditional approaches to warehousing and analytics should be entirely suitable to analysing SCADA-derived data.

The bottom line is that Big Data is not very useful as a term and it doesn’t necessarily equate to Hadoop and MapReduce, while the latter do not necessarily map to auto-generated data either.