What is big data?

Apart from industry hype it’s easier to say what big data is not. To begin with it is not Hadoop (see preface to this series: What is Hadoop?). Nor is it simply having lots of data. And especially it is nothing to do with having lots of transactional data.

Let’s think about data growth for a moment. The first thing to note is that petabyte scale storage issues are not new: CERN had a distributed Objectivity-based database holding a petabyte back in the 90s, long before the large Hadron collider was much more than a dream in the eye of most physicists. All that’s happened is that the commercial world is catching up on the scientific community. And, of course, it’s all relative: what’s big to me may be chickenfeed to you.

Where is growth coming from? Well, we can assume that all sorts of data are growing. If you are a successful company then you would expect transactions to be growing, so your traditional data warehouse is growing. But it’s growing incrementally, not by orders of magnitude. In a survey we conducted last year, respondents reckoned, on average, that data warehouse capacities had doubled in the previous five years. They estimated the same growth rate in the next five years.

The same argument also applies to content. Certainly content management requirements are expanding, but not especially fast.

No, the two major sources of growth are instrumentation on the one hand and external data on the other. These actually overlap.

Instrumented data means anything that you are monitoring. This could be based on sensors, RFID, smart meters, monitoring of web sites, log data from databases and network devices, call detail and IP detail records, SCADA devices and so on. This sort of data is sometimes called machine-generated or interactional data. Now, log data has, for years, been collected and stored using log management, database activity monitoring and SIEM (security information and event management) systems, while the same is true of web site monitoring (clickstream analysis). However, what has changed is that we are now instrumenting more and more things and/or we have realised the potential of the instrumented data that we previously threw away.

As an example of this last point, a typical oil rig has something like 40,000 sensors on it but most of that data is neither collected nor analysed. To take another example: blogs and social media have been around for years, even if Twitter is more recent: organisations are now realising that sentiment analysis may have a useful purpose.

So, the issue is twofold: there is more of this instrumented data and we have the ability to exploit more of what we already have. This last part is important: while there was plenty of this data around in the past we lacked the ability to process it easily and inexpensively. This is where Hadoop comes in or, more specifically, MapReduce, because it enables the investigation of this sort of information relatively cheaply. This is crucial. The truth is that most companies deploying big data solutions don’t know how much of all this data is actually useful, but the point is that you can take these very large datasets and look for the combinations of data that really are useful without it costing you too much. The process is very akin to data mining – looking for relationships in the data that you know may exist but now knowing what and where they are.

There is one other point to bear in mind: for some applications, for some types of instrumented data, putting the data into Hadoop or some other data storage (including a traditional data warehouse) and then analysing it, may be too slow. Where you have very large amounts of data combined with low latency requirements then you may need to use complex event processing rather than database technology for analysis purposes. However, this is getting beyond the point: we are here discussing what big data is; not how you implement it, which I’ll come back to in another article.

The bottom line that is that big data is not really about the data or about database management systems but is about how you query ALL relevant data, regardless of whether it is generated internally or externally, and either after it is stored or before you do so. If IBM had not already purloined the term BigInsights for its Hadoop-based product I might be inclined to suggest using that term as being more meaningful than “big data” but I guess we are stuck with the latter.