Critical, relevant and irrelevant data

Archival is a way of keeping data that you need to retain (for a variety of reasons) but which you could not afford to store on-line. Traditionally, deciding what data to archive off your data warehousing system has been the function of archival policies that are predicated on one of two things: either time or frequency. That is, archival policies are typically something like “if it is more than six months old then put it onto near line storage and if it is more than a year old then put it on tape” or “if this information has not been accessed by anybody within period x then put it onto near line storage and if period x + y then put it onto tape” or some combination of these two.

The advantage of such an approach is that it is simple: it is easy to formulate these rules and it is easy for software to monitor the data so that these rules can be enforced. The disadvantage is surely that such rules are simplistic.

The data that you need in your data warehouse is the data that you need. Which sounds silly but is a truism: you need what you need to do your job and this can’t be determined by either time or frequency of use but only by relevance to the task in hand. Hence the title of this article: the archival policies that you would really like to have would keep critical data in the warehouse, relevant data in near line storage and data that you don’t care about on tape.

Of course, the problem with this is determining what is critical, relevant or irrelevant. Let me be more specific: critical data is information that you know you need now, relevant data is information that you will probably require next week or next month but not right now, and irrelevant data is information that you have to keep (usually for compliance purposes) but hope never to have to access again.

So, who makes such decisions? Well, if you have a data governance council then this is the body that should make such decisions or they might delegate it to data stewards. However, regardless of whether you do this in a formal way via data governance or you leave it to IT the decision would end up being subjective. Moreover, you can’t simply go around to everybody and ask them what information they are going to need next week. To begin with, the task would be impractical from a logistics point of view and, in any case, they simply wouldn’t be able to be accurate about such predictions. In other words, the distinction between critical and relevant information is spurious.

So, what about irrelevant data? The problem is, as I wrote about in a previous article, that you need historic information in order to do any sort of trend analysis. It is difficult to think of any department within a company that might not want to do trend analysis on an on-going basis: you want to measure customer satisfaction, marketing campaigns, supplier return rates and financial trends; even the HR department might want to do trend analysis against recruitment agencies or to look at trends in staff turnover. In other words, there is no such thing as irrelevant data either.

In other words, archival is a bad idea. Of course, it potentially saves money but it is a stopgap because you can’t do what you really like to do, which is to retain all data online. What is the solution? Well, you could implement multiple data marts for each of these different departments but this leaves a problem when you want to do any sort of analysis that crosses these data marts and it also poses a control problem if data marts are proliferating. Alternatively, you could simply have one single, very large data warehouse. But this is only feasible if the performance is good enough and if it is relatively inexpensive in terms of hardware, software and installation costs. To date, the only company that I know that is actively addressing this possibility is Dataupia. I have written about the company before, so I won’t belabour its virtues but it is an approach worth consideration.