DataDomain boosts de-dupe power and capacity, and answers critics

De-duplication
(de-dupe) specialist DataDomain today [12th May] launched its
fastest—and the industry’s fastest—de-dupe appliance, the DD690.

The stand-alone
unit, which has an Intel quad core CPU, can achieve up to 1.4TB/hour
throughput, or up to 170MB/sec in a single-stream for a large database. Its
addressable physical capacity is 35.3TB. However, up to 16 DD690 arrays can be
installed in the DDX cabinet, to provide a maximum capacity of 28PB and an aggregate
throughput of 22.4TB/hour.

What DataDomain
describes as “the fastest single-stream de-dupe engine in the world” uses its
own operating system with an architecture to maximise processing and indexing within
the CPU while minimising disk access. DataDomain’s VP of product management
Brian Biles told me that this approach would be difficult to match. “We expect
a 50% de-dupe speed-up every time Intel doubles the number of CPUs.”

The big positive
in de-dupe is that it drastically reduces nearline and/or offline storage.
(Primary storage is never de-duped.) In simple terms, all the data is examined
as it is backed up and duplicate files and/or blocks (or even smaller segments)
are replaced by a pointer tag to where the single instance of the data is held.

This picture
becomes even more attractive—at least using the DataDomain architecture—if
data is coming from a number of remote locations. A further announcement is the
extension of its processing capability to allow up to 60 separate data streams
to be de-duped before transmission over a WAN to, for instance a single large
receiving DDX. Clearly, the less physical data that needs to travel over the
wire, the faster this is completed. Moreover, bringing separate data silos
together in this way means a greater space saving—since, otherwise, identical
data will remain replicated in the different silos—and is easier to manage
centrally.

Competitive
approaches are, in general, not as efficient in space-saving; but DataDomain
expects its de-dupe to achieve a 10x reduction and, combined with data
compression, will achieve a 20x output space reduction—with a 95–99%
reduction in cross-site transmission bandwidth. Then comes the question: “In
that case why wouldn’t everyone do this?”

There are several
answers to this, and most of them revolve around where it is applied. DataDomain’s phenomenal growth—doubling
turnover in three quarters since it IPO’d last year and adding nearly 300 new
customers in the past quarter alone—is a testament to the fact that most
large organisations are now taking the plunge. (The company is now the de-dupe market
leader in the US and a recent
survey suggests it will soon overtake Symantec and EMC in Europe
too.)

A second deterrent
is that de-dupe performance can be a dog. That is why throughput figures are
key, and why a stand-alone de-dupe appliance tends to be the best option; this
is also part of why some vendors only
provide de-duping on already
backed-up data (as a background task that still slows down other processing and
saves less space). In some systems performance can be improved but only at the
expense of output disk space, to some extent defeating de-dupe’s whole purpose.

Perhaps most of
all it is considerations about recovery from backup—and especially disaster
recovery (DR). The data is not a mirror of the live system and, if recovery is
needed, it has to be ‘reconstituted’ to its original form. So, if an immediate switchover to a remote DR site
is needed, forget de-dupe—at least for the critical on-line data. But in all
other cases, it comes down to how efficiently and reliably the recovery can be performed.

The output file
will be the same in either case. Retrieving blocks using existing pointers will
be faster than when they had to be created; but, of necessity, it will be
slower than contiguous reading—unless segments
are placed in optimum sequence and sometimes memory-held so no disk access is
needed at all. DataDomain does try to arrange its stored segments optimally for
retrieval performance.

Potentially, then,
recovery can be as quick as would be the case without de-dupe—largely
removing that objection. Likewise, if some de-duped data goes to off-line
archive on tape, and has to be reconstituted first. However, this performance
level will not be achieved by all solutions, so performance testing vis-à-vis
likely SLA requirements should be considered
in all major evaluations.

The reliability
and integrity of the data, and protection against hardware faults, is covered
by such things as on-the-fly error detection and correction and continuous disk
‘scrubbing’ to remove errors before they become a problem. So, short of a DR
regime where up-to-the-minute mirroring is crucial, I ask again: “why wouldn’t
you include de-dupe?” It is, anyway, very green (greatly reducing disk space
and network bandwidth), so energy and heat saving—which provides an obvious
ROI.