Bringing governance to big data

It should be completely obvious that just because data is “big” doesn’t mean that it needs any less governance or security than any other sort of data. However, all too many companies seem to have the same sort of blind spot over data quality for big data as they used to have over conventional data. Pretty much a “it’ll be alright on the night” mentality and we can rely on hand coding if we need to.

In part, perhaps, this attitude has been fostered by the lack of specific tools designed for big data environments such as Hadoop. However, this position is changing. Earlier this year Trillium announced Trillium Big Data and now, within a matter of days of one another, both IBM and Informatica have made important announcements with respect to big data. As the big beasts in this area – not just for governance but also for data integration, these are most significant and most likely to sway the market.

In the case of IBM, it has announced IBM BigInsights BigIntegration and IBM BigInsights BigQuality (which seems like too many “bigs”) while Informatica has come out with Informatica Big Data Management, which encompasses both data integration and data governance.

Of course, both vendors will claim that they have advantages over the other and I have not looked at either in sufficient detail to come to any firm conclusions. From an IBM perspective the main features of these releases (apart from their very existence) are that a) the solutions run on Apache Spark, b) Optim data masking has been integrated into Information Server so that you can mask as a part of your transformation and loading processes, c) that you can combine this with the InfoSphere Information Governance Catalog for things like discovering the lineage of masked data, and d) that InfoSphere Data Replication provides real-time replication into Hadoop.

If you compare this with Informatica’s release, then Big Data Management runs using the company’s proprietary Blaze engine, on top of YARN (as is the case with IBM). Which approach is faster, I don’t know. Both companies would probably claim these particular laurels. In any case, performance is a moveable feast. As far as points b) and c) are concerned these appear to give similar capabilities to Informatica’s Secure@Source, albeit in a rather different way and without the risk-based aspect of Secure@Source.

More interesting, however, are the features of Informatica Big Data Management for which, as far as I know, there is no IBM equivalent. The first of these is that Informatica has introduced an optimiser into its Blaze engine. If you think about it, transformations in data integration processes often do things similar to SQL processing: you will frequently, for example, want to use join operations. In which case, having an optimiser, along with query plans and so forth, makes a lot sense. I also like the way that Informatica handles dynamic schemas and, in particular, I like the Live Data Map within the Universal Metadata Catalog. Regular readers will know that I am a fan of graph databases and graph-based visualisation and the Live Data Map provides exactly this for exploring your metadata, thereby making it much easier to explore your metadata.

As I said about performance, all of these things are subject to change: vendors leapfrog one another all the time. The really significant fact about these releases is their very existence. The vendors (and I, and no doubt other analysts) have been talking about the importance of governance (in particular) for big data, for, literally, years. Now Informatica and IBM (and Trillium) have put their money where their mouth is. It’s time that users sat up and listened.