Governance and Hadoop - Informatica has just released HParser

Before I start I should say that this is not just about Hadoop but also about all its extensions, distributions, alternatives or replacements. For brevity’s sake we need some terminology here so I will refer to these generically as BDDB (big data database) solutions.

As someone said to me recently, “it’s only when you start to look at Hadoop that you start to realise what’s important about the MS in DBMS”. However, it’s not management that I want to discuss and, actually, while a major focus of the BDDB vendors is to provide the sort of manageability you might expect, there is some way to go before these are fully robust.

What I want to discuss is governance, and there are a number of issues. Firstly, the advocates of BDDBs suggest that you might well want to include conventional structured data within your BDDB as well as unstructured data, because relevant queries require both types of data. Fair enough. But how do you ensure the quality of the structured data? I don’t know any data quality tools (or profiling tools for that matter) that run against BDDBs. That means that a) you don’t care about data quality or b) you cleanse the data at source (a good idea but many companies don’t do it) or c) you load the data into your conventional warehouse first, cleanse it and then export the data out to the BDDB. In any case, governance goes beyond your traditional data quality. It is also about validating your business rules so that the logic for transformations is sound from a business perspective.

Secondly, there are quality issues around the unstructured data that you may load into your BDDB. Consider stock ticks or call detail records: both of these are frequently duplicated and those duplicates need to be filtered out. Similarly, lots of RFID events have no value, typically when nothing changes, and you also need filtering mechanism here too. (IBM has such a solution but, strangely, it is in the WebSphere portfolio rather than InfoSphere where all the big data stuff is – go figure). Further, this issue is not limited to sensor and event-based information. Consider tweets: I write a tweet (not often) and someone forwards it, someone else comments on it, it goes viral (not likely): how many tweets is that? If the original tweet gets copied 5,000 times does that count as one tweet or 5,001? I guess it depends on what you want to do with the information but there is certainly a governance issue.

Thirdly, there is the question of how you get meaningful information out of unstructured data? This is particularly important if you want to combine this information with structured data. Here, at least, there is now a product. Informatica has just released HParser, which is a parser optimised for Hadoop. That is, it runs on Hadoop (with all of its parallelism and so forth) and which is used to parse unstructured data and semi-structured data in much the same way that a data quality tool might parse a product description.

HParser runs on nearly any distribution of Apache Hadoop and Informatica distributes it via its marketplace. Informatica has also just announced a partnership with Hortonworks and you can also see the link to Informatica’s market place from the Hortonworks data platform. The nicest thing about it is that doesn’t require the analyst or developer to know anything about MapReduce as s(he) just works within a graphical user interface and the software takes care of the implementation. The parser will parse web logs, call detail records, various providers of financial information (Bloomberg et al) and, more generally, XML and JSON (JavaScript object notation). The software will discover relationships and hierarchies within the data and flatten them.

I dare say that HParser will not be the last product to start to tackle governance within the BDDB space but it is the first.