How much data do you need?

There was an article in Information Week last week whose summary was “companies have more data than they think, they need less data than they think, and predictive models consistently outperform human decision-making abilities.”

The problem with this statement (and it is typical of what a wide range of commentators are saying) is that it is a generalisation and is by no means always true. I wouldn’t argue with the first part but both of the other two parts are subject to dispute for one reason or another. If we take the final part of this statement I don’t disagree per se but the implication is that predictive models are always better. However, that only applies where the complexity is not too great. I will give an example in a minute.

As far as needing less data than business think this is too sweeping. Take the UK’s ONS (Office for National Statistics) as an example. According to the Sunday Times it bases its estimates “on monthly business surveys and officials have just 44% of the data required for the final estimate of output”. Now, you or I might think that 44% could provide a pretty representative sample but, between the beginning of 1990 and the end of 2010, ONS estimates were revised upwards 83% of the time by an average of 0.8%.

For example, in July 2003 year on year growth in the UK economy was estimated at 2.1% and, as a result, the Bank of England (BoE) decided to cut interest rates which led to further inflation in the already over-heated housing market and, subsequently, meant that the crash in 2007 was worse than it might otherwise have been. Subsequently, the ONS revised its growth estimate for that time to 3.6%. Had the BoE had accurate figures in July 2003 they might never have reduced interest rates and we might have been in a very different position in 2007 and thereafter.

The point I want to make here is that, firstly, sometimes you really do need all of the data (or as close to that as you can get) before you attempt to make decisions and, secondly, that there are some sorts of decisions that are not amenable to predictive models and for which, indeed, there are no suitable predictive models. In particular, predictive models are only of value where there are a limited number of variables, all of which are well understood. This is certainly not the case when it comes to economics. To take a separate example, it should be relatively easy to build a predictive model that forecasts whether any particular marriage will be successful (however you measure that) because there are lots of statistics about divorce rates and it should be easy enough to find correlations by age, social class, education, employment and so forth that could predict whether or not a particular marriage would end in divorce (say). But would you use any such model to determine your future? The model might be statistically accurate to the extent that it is useful for understanding crowds but that’s not true when it comes to individuals.

Thus, to return to the statement in the first paragraph of this article: append “sometimes” or “generally speaking”.