Managing data catalogue proliferation

Data catalogues are becoming increasingly popular. Put simply, a data catalogue is a repository of information about a company’s data assets: what data is held, what format it is in, within which (business) domains that data is relevant, and where it is (in which databases and/or files). The information within a data catalogue may be classified further by geography, time, access control (who can see the data) and so on. Data catalogues are indexed and searchable and support self-service.

Note that I prefer this term to information catalogues, which has other connotations; and despite some sources erroneously referring to database catalogues as data catalogues.

The problem is that these catalogues are proliferating. There are pure-play vendors such as Alation and Waterline; there are data lake management suppliers such as Podium Data, Zaloni and Unifi; there is Cambridge Semantics with its Anzo smart data lake (more of which in a forthcoming article); there are all the purveyors of data governance solutions and, last but not least, pretty much every company in the BI and analytics space.

In theory, one would like a single data catalogue that spanned the entire enterprise and was used by all relevant products and tools. Unfortunately, history tells us that this is not going to happen, and organisations will inevitably end up with multiple data catalogues. Rather than wait for these silos to be put in place it would be sensible to put in place a strategy to overcome this problem before it becomes a major issue.

So, how would you approach this? Do you want a catalogue of catalogues? Do you want master data catalogue management? I don’t think so. I think a better approach will be to use a knowledge graph.

Knowledge graphs are graphical representations of the relationships between things. For example, Thomson Reuters is providing a financial knowledge graph for its clients that allows you to explore which companies own which subsidiaries, which organisations share which non-executive directors and so on and so forth. There are lots of other potential use cases: understanding supply chains, for instance. In general, knowledge graphs apply wherever you have complex non-hierarchical relationships.

In the world of big data and IT I think the use of knowledge graphs as a meta-layer that spans data catalogues, has a lot of potential. Cambridge Semantics actually does this within its Anzo product, supporting the concept of sub-catalogues. For example, you might want a catalogue specifically for analytic models where you want to know which models are deployed, in which applications, who is the model owner, who uses the model and so on.

I think this idea has wider applicability and knowledge graphs generally can be deployed to resolve the impending catalogue proliferation issue. In this context it is worth commenting that a number of the catalogue vendors offer knowledge graph capability: for example, both IBM and Informatica do so. In theory, these might be independent of the data catalogue(s) in use but in practice they are unlikely to be perceived as such, so there is scope for truly agnostic solutions. Franz, with its AllegroGraph database, is focusing particularly on supporting knowledge graphs (in general, not just for the cataloguing use case) and, if I am right in thinking that this is best way forward, then no doubt other graph database providers will start focusing on this issue.