De-dupe for big storage space-savings?

The latest hot topic for storage users and vendors is de-duplication (de-dupe). The idea is simple but was not really feasible until low-cost SATA drives made ‘near-line’ storage and virtual tape libraries (VTL) much more cost-effective. On the other hand, de-dupe is an immature technique with vendors offering different approaches with variable results. So users need to consider their options before plunging headlong into what could be, for them, a sub-optimum solution.

Explaining the concept is the easy bit. The idea is to eliminate redundant (duplicate) files or, to go more granular, blocks or even bytes of duplicated data. The process can be a little like what happens with e-mail storage when an e-mail is sent to multiple persons in a company; only one physical copy of each e-mail and its attachments is stored, but pointers (tags) are provided for all the recipients so that the one instance of the e-mail can be retrieved and viewed by all of them. A big difference is that e-mail content never changes whereas files can keep changing.

Most companies have multiple copies of every file and database on their systems, thanks partly to the need for back-up copies, including information held specifically for compliance purposes and off-site copies needed for disaster recovery protection. It has been estimated that businesses back up every file an average of five times, rising to eight in the financial sector. Reduce the number of file copies even by one while ensuring resilience for retrieval and recovery and you can save an awful lot of storage space. Reduce the amount of data to backup and you reduce the size of all the backups you take.

The simplest way this is done is to substitute a pointer instead of a file whenever a file duplicate exists on the system to be backed up. De-duping usually takes place transparently as a by-product of another back-up function so does not introduce any management overhead.

Now some storage managers have reported huge space savings, perhaps 10–20 times (i.e. saving most of the backup space) by this method. Beware of such figures. First they depend partly on the de-dupe approach employed (see below) and, second, partly on how inefficient in storage the company was previously. That is not to knock the technology. There are, of course, some pitfalls but there are huge cost and space benefits to be had.

The approaches
Explained simply, the most straightforward approach is to de-dupe automatically at file level during a full system back-up (the ‘in-line’ approach); if a file has been detected as repeated in the storage pool then, instead of copying the whole file, a pointer to the second copy is all that is inserted, and this is repeated for all duplicate copies of all files. This means the back-up will only hold a single instance of each file. Recovery then has to take account of the pointers to retrieve the files not physically copied. If you take this process down to block or byte level, you end up with far more pointers but also achieve an even smaller backup.

However, in order to achieve this, something like a reference table needs to be built up during or at the start of the back-up, to record duplicate files (or blocks) so that, after the full file or block has been copied once, an address pointer is inserted instead when each further instance is detected. As the de-dupe backup progresses, this table will get ever longer and the algorithm-based calculations can get more complex, slowing the speed of backup. If this table is maintained in memory—which reduces the performance hit caused by otherwise introducing extra disk IO when updating the table—there is instead a danger of exceeding memory capacity. Either way, there is a performance hit with an increased potential for back-up failure.

Some de-dupe hardware/software appliances have begun appearing to counter the performance overhead. They give a double performance benefit, first because they process off the main system and second through using hardware to perform part of the process.

Another approach performs the de-dupe off-line after the main back-up has completed. This means there needs to be no overhead in terms of live systems performance. However, while the end-result is the same in terms of a much-reduced back-up size, this is only temporary until the next full back-up occurs; space for a full back-up still has to be allocated. The advantage only really kicks in if this backup is itself being further backed up.

Systems doing de-dupe, which is really a form of data compression, can perform more traditional data compression at the same time. This further reduces the output back-up size—but, of course, it further increases the back-up time as well as the time to re-form the information when recovering. (Again, this can be mitigated if a hardware-based de-dupe appliance is used.)

There are also other popular technologies to consider in relation to de-dupe. For instance continuous data protection (CDP) (or, more usually, near-CDP) removes the need for a back-up window altogether since every data change is recorded as it occurs. It also allows recovery to the status immediately before system failure to be much faster—but it does not sit well with in-line de-dupe techniques which will undermine crucial CDP performance.

Then there is software that records all changes since the last backup which is used to achieve an up-to-date picture after restore. This works better than CDP alongside de-dupe, assuming the transactions for recovery are applied after the restore from the de-duped backup has been completed so that all files have been re-established.

But if you are thinking primary storage should itself be de-duped, think again. Remember that complications would occur as soon as one file was updated while duplicates of it were not changed, since that would mean it would be necessary to immediately establish a different instance of the file concerned—a sudden unwelcome performance hit. Moreover, any software accessing it would need to recognise and unpack the de-duped data.

However, this does highlight an obviously commonsense exercise, which is to try and physically eliminate all unnecessary file duplicates that are present on primary storage, since this anyway has a knock-on advantage of reducing the size of all back-ups to it that occur, even without deploying de-dupe itself.

More feasible, but needing care (and dependent on the systems accessing it), one may consider de-duping data that passes to some second-tier ‘near-line’ storage. The reduction in space taken up by near-line data means, for instance, that data could be held for longer before having to go to off-line archive, and so could be retrieved quickly for a longer period. This might be considered, for instance, for data retained separately for rapid retrieval for compliance.

Think through the benefits
The benefits from the huge space savings cannot be over-emphasised. Volumes of data are soaring, and increased pressure to retain data purely for compliance purposes is only exacerbating this. So the storage savings from de-duping are bound to continuously increase. Among the spin-off benefits are: datacentres close to space or power overload buying themselves more time before an expensive major upgrade or relocation is needed, reduced power and cooling running costs (also a ‘green’ dividend), and less disk drive utilisation leading to greater average longevity and/or lower repair costs.

Then, if the copy is being written out over a WAN to remote storage by the de-dupe backup, there is a performance bonus from the greatly-reduced amount of data being transmitted, with the same being true for a recovery after breakdown across the WAN. An equivalent bandwidth benefit is achieved if transmitting across in-house networks.

However, at this early stage of adoption, users should demand vendors produce concrete evidence of de-dupe software/hardware resilience. They should also calculate the likely performance hit from the different options and consider whether that is acceptable for their organisation, since no two businesses’ performance needs are the same. The actual options available may also be reduced by the way in which other installed software is being utilised. For instance, another vendor’s installed software that accesses ‘near-line’ storage would simply fall over if suddenly confronted with it in de-duped form (only resolved through inserting a routine that presents the data in the format expected, which would then have its own performance overhead).

Finally, investigations of the de-dupe options may reveal further incompatibilities or performance overheads when mixing them with other software such as CDP or virtualisation software (or appliances) enhanced for thin provisioning. Even if this is not the case, the additional space-savings achievable through de-dupe may be less than expected if combined with other space-saving technologies.

Overall, the future for de-dupe looks good to me, but don’t get carried away with the hype. Think it through first.