The problem with matching

Imagine you are a business person with no particular knowledge of the details of IT who is having a demonstration on data quality matching. The vendor you are engaged with shows you a series of records of people whose details are all more or less the same but not quite. And then the supplier shows you how the software can calculate that there is a 93% chance that person A is the same as person B and a 95% chance that he is the same as person C and an 87% chance that he is the same as person D.

Very impressive. Except that, to use the vernacular, it is usually “bleeding obvious” that these are all the same people but that person E is someone else entirely. Any relatively sane person is going to react by asking why he should spend a whole bunch of money on something that appears so facile.

Of course, data cleansing software does lots of things other than simple matching and de-duplication. But if you are a marketing manager who simply wants to clear up his mailing lists then you probably don’t care much about all that extra functionality. Moreover, you probably don’t care very much about perfection. If you could simply and easily clear out, say 80% of your duplicate records, then that would be a result.

So, what this sort of person would like is an automated matching and de-duplication solution. You would set a threshold for matching percentage and if the match exceeded this percentage then you would like the software to automatically remove or merge the relevant duplicates. Whether you want the software to directly access your database is perhaps another matter, and you might want to have an indirect method of doing this, but you would still want it to be an automated process. Of course, automated removal of duplicates wouldn’t be appropriate for applications where the data is more critical but in environments such as marketing this shouldn’t be a problem.

Similar considerations would apply when it comes to comparing two different files. For example, you might want to compare incoming records against a master file (that is, to check all master records compared to all incoming records, to find out if incoming records already exist in the master); or you might want to measure record consistency between vendors/customers/partners/products across two systems; or you might want to find the overlap between the vendors/customers/partners/products that exist between two systems.

Key requirements, needless to say, are that the solution be inexpensive and easy to use. The former probably means that the direct sales model traditionally adopted by data quality vendors will be inappropriate, so the solution will need to be either available via a download or as a cloud-based offering.

The big question of course is why there are no such products available of this type (as far as I am aware)? I can’t believe that it would be that technically difficult to develop such an offering. So, perhaps I am wrong in thinking that there would be demand for such a solution (I would be interested in feedback on this point – if you are on the Bloor Research website please complete the Quick Poll in the right-hand column, otherwise please go to this link and vote).