The limits of machine learning

Consider credit card fraud detection. The way that this is typically handled is that there is some algorithm embedded in the detection software that looks at the relevant pattern of events and suggests whether this particular transaction is likely to be fraudulent. These algorithms turn up a lot of false positives. Historically, what would happen is that periodically a data scientist (data miner) would re-analyse the data and attempt to create a better algorithm with less false positives. What machine learning is intended to do is to replace this manual process with an automated one whereby you feedback results to the software and it iteratively improves its own algorithm(s).

This technique can also be applied to semi-automated processes such as data stewardship (data matching and cleansing) or in discovering sensitive data that you need to mask. Other examples include data preparation tools where the software might recommend appropriate join keys for data blending or, more generally, where you have recommendation engines suggesting the next best offer. No doubt, there are a bunch of other use cases.

All this is well and good. What nobody seems to be asking amongst all the hype, is just how much difference machine learning makes or will make? I think it is fairly clear cut that it should improve performance but I for one have not seen any figures to suggest how much. If anyone has seen like-for-like comparisons I’d love to see them but, failing that, my guess would be that machine learning is, at best, going to make a five to ten percent improvement in performance. In some cases, it might only be one or two percent. Bear in mind too, that there will be an initial slow ramp up – as the software accumulates data – which then accelerates, followed by a drop off and, eventually, a long tail. So, improvements won’t come all at once.

How good is five to ten percent? Actually, quite a lot. Consider processing tens of millions of credit card transactions daily. Even a one per cent reduction in false positives is going to avert a lot of disgruntled customers. So, a five to ten percent improvement in supporting a data steward could be quite significant. It might actually compensate for the additional data that the steward is having to look at every day.

What machine learning is not, is a panacea. It won’t make everything go away. There will still be false positives and false negatives. These may reduce in percentage terms but they could still increase in real terms, simply because the problems (data overload) are getting bigger.

As I said, I have not seen any figures that demonstrate how effective machine learning is. Apart from anything else, I would expect this to be product and domain specific. However, if anyone has got any figures I’d love to see them. In the meantime, I am a fan of machine learning but I recognise that it has limitations.