De-duplication (de-dupe) specialist DataDomain today [12th May] launched its fastest—and the industry's fastest—de-dupe appliance, the DD690.
The stand-alone unit, which has an Intel quad core CPU, can achieve up to 1.4TB/hour throughput, or up to 170MB/sec in a single-stream for a large database. Its addressable physical capacity is 35.3TB. However, up to 16 DD690 arrays can be installed in the DDX cabinet, to provide a maximum capacity of 28PB and an aggregate throughput of 22.4TB/hour.
What DataDomain describes as "the fastest single-stream de-dupe engine in the world" uses its own operating system with an architecture to maximise processing and indexing within the CPU while minimising disk access. DataDomain's VP of product management Brian Biles told me that this approach would be difficult to match. "We expect a 50% de-dupe speed-up every time Intel doubles the number of CPUs."
The big positive in de-dupe is that it drastically reduces nearline and/or offline storage. (Primary storage is never de-duped.) In simple terms, all the data is examined as it is backed up and duplicate files and/or blocks (or even smaller segments) are replaced by a pointer tag to where the single instance of the data is held.
This picture becomes even more attractive—at least using the DataDomain architecture—if data is coming from a number of remote locations. A further announcement is the extension of its processing capability to allow up to 60 separate data streams to be de-duped before transmission over a WAN to, for instance a single large receiving DDX. Clearly, the less physical data that needs to travel over the wire, the faster this is completed. Moreover, bringing separate data silos together in this way means a greater space saving—since, otherwise, identical data will remain replicated in the different silos—and is easier to manage centrally.
Competitive approaches are, in general, not as efficient in space-saving; but DataDomain expects its de-dupe to achieve a 10x reduction and, combined with data compression, will achieve a 20x output space reduction—with a 95–99% reduction in cross-site transmission bandwidth. Then comes the question: "In that case why wouldn't everyone do this?"
There are several answers to this, and most of them revolve around where it is applied. DataDomain's phenomenal growth—doubling turnover in three quarters since it IPO'd last year and adding nearly 300 new customers in the past quarter alone—is a testament to the fact that most large organisations are now taking the plunge. (The company is now the de-dupe market leader in the US and a recent survey suggests it will soon overtake Symantec and EMC in Europe too.)
A second deterrent is that de-dupe performance can be a dog. That is why throughput figures are key, and why a stand-alone de-dupe appliance tends to be the best option; this is also part of why some vendors only provide de-duping on already backed-up data (as a background task that still slows down other processing and saves less space). In some systems performance can be improved but only at the expense of output disk space, to some extent defeating de-dupe's whole purpose.
Perhaps most of all it is considerations about recovery from backup—and especially disaster recovery (DR). The data is not a mirror of the live system and, if recovery is needed, it has to be ‘reconstituted' to its original form. So, if an immediate switchover to a remote DR site is needed, forget de-dupe—at least for the critical on-line data. But in all other cases, it comes down to how efficiently and reliably the recovery can be performed.
The output file will be the same in either case. Retrieving blocks using existing pointers will be faster than when they had to be created; but, of necessity, it will be slower than contiguous reading—unless segments are placed in optimum sequence and sometimes memory-held so no disk access is needed at all. DataDomain does try to arrange its stored segments optimally for retrieval performance.
Potentially, then, recovery can be as quick as would be the case without de-dupe—largely removing that objection. Likewise, if some de-duped data goes to off-line archive on tape, and has to be reconstituted first. However, this performance level will not be achieved by all solutions, so performance testing vis-à-vis likely SLA requirements should be considered in all major evaluations.
The reliability and integrity of the data, and protection against hardware faults, is covered by such things as on-the-fly error detection and correction and continuous disk ‘scrubbing' to remove errors before they become a problem. So, short of a DR regime where up-to-the-minute mirroring is crucial, I ask again: "why wouldn't you include de-dupe?" It is, anyway, very green (greatly reducing disk space and network bandwidth), so energy and heat saving—which provides an obvious ROI.