Big data issues

This is the first in a series of articles I am planning to write about the management and governance of big data. That is, I am going to be concerned with how you ensure that big data – whatever you mean by that – is fit for purpose and usable for business purposes. Conversely, I am not concerned with whether this data is stored in Hadoop or MongoDB or in your data warehouse. And I am not really bothered about what sort of data it is, whether it is machine generated data or social media data, or video or audio, or even if it is transactional data except that different types of data may require different emphases as far as governance is concerned.

Just to clarify this: machine generated data often has lots of duplicated information that you would like to remove, and there may be missing data because a sensor has failed or a network connection has broken, which you would like to access, but the data itself is pretty reliable and it doesn’t typically include sensitive data, so you don’t need capabilities like data masking or data cleansing. Social media data, by contrast, may certainly hold sensitive data and we are all aware (I hope) of how much data cleansing may be needed with respect to transactional data.

So, the focus for management and governance may be different for different types of big data but the fundamentals are the same. And what are those fundamentals?

I think there are three.

Firstly, you need to be able to integrate your big data systems with other relevant data that resides in your environment. To take what might seem a simple example, if you have smart meters the data that is captured needs to be integrated into your invoicing environment, it will be analysed for capacity planning purposes, you will want to use the data in conjunction with fraud detection systems, you will need to link to service and repair management systems, and so on. It will be very rare for big data to exist in splendid isolation. Moreover, different approaches to integration will be needed in different situations and these may change over time. In other words, the integration environment needs to be very flexible.

Secondly, the data needs to be trustworthy. There are actually two aspects to this: in the first case you need to know that it is secure, especially with respect to personally identifiable information and data privacy; and in the second case, that the data is of sufficient reliability that you can trust it as the basis of your decision making.

Finally, data needs to be understood in context. For example, social media data needs to be understood within the context of CRM or brand management environments. Of course, this isn’t much different from any other data that is used for analytics but that is precisely the point: big data needs to be managed and governed just as much as ordinary data does, albeit with some qualifications.

Anyway, those are the issues: I will explore each of them further in the articles in this series.