Data archival is the process of moving data that is no longer actively used to a separate data storage device for long-term retention of that data. In other words, the archive consists of older data that is still important and necessary for future reference or which must be retained for regulatory compliance purposes.
In practice, data archival is a subset of information lifecycle management (ILM) where the additional component in ILM is with respect to ensuring that data is archived on an ongoing basis as the relevant data reaches its 'sell-by date'—which may either be on the basis of it being so many months old or, preferably, because the data is no longer in active use. ILM policies also determine when the data is eventually deleted.
ILM policies determine the point at which data should be archived. This may be because it is 'x' months old or it may be because no-one has accessed that data for a certain period or it may be for some regulatory reason; but whenever that point is reached the data is moved from the main online environment to the archive (which itself may be online but on less expensive hardware and perhaps more deeply compressed).
Methods used for archival storage vary. A common method is for the data to be indexed and stored in a file structure (for example, XML) with search capabilities so that files and parts of files can be easily located and retrieved (the likely use cases for retrieval should be considered up-front; the cost, timeliness and efficiency of retrieval may be an issue). However it is also sometimes a requirement to be able to access the data from the originating application. In this case it will be necessary to ensure referential integrity during the archival process. This will often mean that data (profiling and) discovery tools need to be used to establish relevant associations within the data to be moved.
It is often the case that the access controls and security that pertain to the archived system are different (and more lax) from the originating system. If this is the case then data masking technology, as well as more general obfuscation techniques, may need to be applied to the archived data.
Data archival has significant benefits for both business and IT managers:
- It reduces the cost of storage, because archived data tends to be more highly compressed than online data, and because it uses lower cost storage in the first place.
- It clears up space on the originating system thereby improving scalability. Depending on the agreements in place this can also reduce database licensing costs.
- It improves performance for online systems.
Data archival is commonly used in conjunction with data migration projects. If you are migrating from one database to another, say, then it may make sense to archive some of the older data from the original system rather than migrating. This reduces the size of the migration task which is, in itself, a benefit.
Ultimately, the person who should most care about data archival is the CFO. While there will be an initial investment in the relevant software and expertise, data archival should significantly reduce costs over the medium and long term.
Data archival has been around for a long time and it has been what might be called a slow burner. While the benefits of data archival are widely recognised it has not, historically, been top of anyone's priority list. This is slowly starting to change as we see more complete suites of products come into the market.
In particular, one major change in this market is the adoption of Hadoop (or other NoSQL databases) as a platform for archival. The low costs associated with such a platform are an obvious bonus. However, since all data is held in triplicate in Hadoop this rather defeats the point, so you really need a purpose-built archival engine built-on top of Hadoop (or HDFS) rather than using Hadoop itself.
The most noticeable change in the market over the last few years has been the recognition that data archival is not a standalone technology. You need data profiling and discovery, data masking and ETL (extract, transform and load) to make it work. Both IBM and Informatica, in particular, have expanded their product sets over the last couple of years in order to provide all of these sorts of capabilities.
There are two approaches to take for archival: archive to a conventional database with whatever degree of compression that company offers or opt for a specialised archival product which will likely offer a greater degree of compression. In the latter camp are companies such as Informtica and RainStor (which announced a deployment on Hadoop in 2013). In the former groups the most naotable new annoucement has been that SAP will support Sybase IQ as an archival platform.