De-duplication in a graph database

An interesting discussion arose during the TDWI conference in London this week. The question was posed: could you use a graph database to do matching and de-duplication?

The answer must be yes. If Bill Clinton and William Clinton (this was the example posed during the session) have the same relationships they must surely be the same person, though given the nature of some of the ex-president’s relationships it would perhaps be better to refer to following the edges of the graph rather than the relationships they represent. In fact, if you are using a graph database to look at terrorist or criminal networks this is precisely one of the things you would be doing as you want to understand which aliases equate to which real individuals.

First of all I should say that I am not aware of any graph vendor packaging up any special facilities to support matching and de-duplication but I imagine that there are things they could do to make this process easier. However, the concept is quite cool. It would mean that you don’t need to license such capabilities from the likes of Trillium or Informatica. Of course there are other data cleansing requirements beyond matching but this does tend to be the bedrock for all such environments so could a graph database be a real competitor?

Of course the big advantage is that there is no additional license fee.

What I don’t really know is how performance would compare. Vendors in the data quality field are apt to extol that their matching engine can outperform anybody else’s: something that is inherently impossible to prove one way or the other, thanks to the fact that you can’t compare match accuracy across platforms.

Nevertheless, my guess is that a graph database could seriously outperform a conventional matching engine. That’s because graph databases have been explicitly designed to explore relationships and that’s precisely what you do when matching: you have two similar but non-identical names and they each have relationships with an address, a mother’s name, a phone number, an email address and so on. Instead of searching through table after table: just follow the edges of the graph.

Will we see this become a practical reality? To be honest I don’t know. The person who raised this issue in the first place said that his company couldn’t use conventional data quality tools for matching (he didn’t explain why) so maybe there is a problem out there that can be solved by using graphs. Certainly if you are going to be using a graph database anyway then it may make sense to look at it for this purpose also.