Big data context

In the first article in this series I wrote about the various ways in which big data needs to be considered over and above what database you are going to host it on, specifically with respect to governance and management of that data. In the second article I wrote about trust: can you rely on this data for decision-making. A second, and related, issue is context. Big data per se is meaningless: it only becomes useful to your business when that data is understood within the proper context. Of course, this is true of all data but it is easy to get carried away when it comes to big data. There have been far too many “oh there must be some valuable information in there somewhere” claims.

So, what does data in context mean? In general terms it means understanding how this data relates to existing sources of data that are well understood. For example, how do comments on social media sites relate to brand management or CRM? This is fairly intuitive. However, as we move into the realm of the Internet of Things these relationships can become less obvious, especially to business users who may not be completely au fait with technology. For example, if you think loosely about smart meters then you might assume that these were just about billing and capacity planning but, in reality, there are a variety of other ways to use that information, for example for fraud detection and to inform service management.

However, context considerations also go more deeply than this. For instance, you will want to know where the data came from (some web sites, for example, are self-selecting and/or have a particular political, religious or other bias – alternatively, is this data from the latest model of smart meter or is it from a previous generation device? – or, thirdly, is this derived data, in which case what is its lineage?), what the terms mean (does “cool” in reference to your product mean a good thing or a bad thing?), how up-to-date the data is (in fast moving markets, older data may be irrelevant and the same may apply to different generations of sensors), who or what has touched the data and, if it has been changed, then how it has been changed (machine generated data often needs de-duplication: a good thing but you need to know that that has been done). All these contextual pieces of information will help to inform you as to how much you should trust this data (see previous article). But, more than that, this sort of information will help you to decide what information you should use in your analyses and what should be left out.

Here’s a concrete example of what we might be talking about: suppose your data scientists come up with the suggestion that “if we put payphones into convenience stores that will reduce crime”. Wouldn’t you want to know how they reached this conclusion, based on what evidence, and whether they had considered if this would simply move the crime from one location to another? Laws of unintended consequences can easily apply, not to mention false correlations such as the famous beer and nappies (diapers).

To put all this more technically, you need metadata about the data you are intending to analyse. There will be different sorts of metadata depending on the data you are analysing just as there will be different sorts of data quality processes (discussed in the last article) that need to be applied but the need is there just as it is with conventional data.