The drunkard’s walk

The drunkard’s walk is a concept used within probability theory and it has application within various aspects of IT. I had better start by explaining what it is.

Actually, I’ll start by talking about tennis. During Wimbledon there have been the usual comments about bad luck with net cords evening out. Unfortunately, that isn’t true. At the start of the match both players have a 50:50 chance of having more bad net cords than their opponent. Now, suppose that you get the first bad net cord, what are the chances for the remainder of the match? Of course, it is still 50:50. So, on average, if you were the person who got the first unfortunate bounce off the net, then you will come off worse in terms of net cords on average. The same is true for bad lbw decisions in cricket.

The drunkard’s walk is an extension of this idea whereby you look at the drunkard lurching in any one of eight directions on the assumption that each lurch is independent of the one before. What you find is that as he or she tends to get further and further away from the starting point. In other words, going back to tennis or cricket, or even capital markets (‘the trend is your friend’), there is a tendency for luck, good or bad, to perpetuate itself. Which is, incidentally, a good reason to support the use of technology in sports, since it removes the bad decision issue.

The drunkard’s walk may seem to contradict the idea of reversion to the mean (which is a fancy way of saying that things even out) but in fact it doesn’t. If you had a million drunks all starting from the same point and monitored where they were after a few hundred steps then the mean position of all the drunks would be at or close to the starting point but each individual drunk would probably be a significant distance away from there: the reversion to the mean applies to the average of all drunks not any individual drunk.

So, things don’t even out for individual tennis players or cricketers. This will be familiar to any database administrator who is getting skew in his partitions: if things evened out you might expect that skew wouldn’t be a problem because it would even out but the fact is that once it starts happening you know it will only get worse.

The actual case in which the drunkard’s walk recently came up was when I was asked about the application of data quality to data mining. Data mining tools typically have facilities for coping with things like null fields and the use of default values but they are otherwise fairly limited in this regard. Moreover, with the exception of SAS, which owns DataFlux (and even then I don’t think the company thinks of DataFlux especially in conjunction with Enterprise Miner), the data mining vendors don’t have partnerships with data quality vendors and don’t talk about it much. And the same would be true of people using Matlab or R. And the reason, I think, is because they expect that any errors in the data will ‘even out’ given that you are working with very large sets of data. But they won’t even out, unless you also run those models a very large number of times against a very large number of datasets. So, if you want accurate data mining models you need to pay close attention to data quality, just as you do for other business intelligence applications.

Other applications of the drunkard’s walk in IT? Well, anywhere there is divergence from the mean: without remedial action it is likely to diverge further. This would include things like deteriorating performance, risk management, and errors in spreadsheets. No doubt you can think of others.