SEPATON shows how partial post-process de-dupe can score over in-line

I gained a more
positive view of post-process de-dupe—or rather what I would call ‘partial post-process’—from meeting
with virtual tape library (VTL) appliance provider SEPATON last week. Its new DeltaStor
de-dupe approach is unique and so deserves a separate review.

De-duplication performed
during an initial backup—‘in-line’ so called—is typically achieved
transparently to any management process by an appliance and (with compression
included) can achieve a 20x or more space saving over a standard backup. Applied
across the board to every file and system, it typically treats them all just as
blocks of data without taking account of file type or content. ‘Post process’, which
applies de-dupe to a backup only after it
is created, initially requires extra
space and typically incurs some management overhead; this is not so smart in my
book.

SEPATON’s
DeltaStor is technically ‘post process’ but different. Its software examines
the backup copy of each individual file and database (‘object’) in turn but,
uniquely, uses its ContentAware stored intelligence to recognise all the leading
vendors’ backup and archive output as these embed their own markers. In SEPATON’s
de-dupe process these markers are extracted before the data is processed.

There then follows
a byte-level examination of the whole data stream; from this the de-dupe
process (which does not use hashing) creates variable-length output representing anything from 128 bytes to the whole
object. “Nobody else does that,” said Miklos Sandorfi, SEPATON’s CTO, who
pointed to a verified 48x space-saving typically being achieved in its VTL
output. It still needs additional space but, as Sandorfi explained, far less
than you might expect…

Since each
backed up file or database is handled as a separate entity, DeltaStor can be
set to start work on de-duping the first file as soon as that backup is
complete and concurrently with the next file backup, and so on (so effectively
only ‘partial post-process’). This
also means the minimum amount of output space that has to be pre-allocated is the
size of the largest file to be backed up plus
the total de-dupe output space (which all de-dupe products need); then remember
that DeltaStor’s de-dupe space will come out less than half that used by the
best in-line de-dupe products. Some files should not be de-duped (for instance
already encrypted ones); with DeltaStor’s approach the decision whether to
de-dupe can be set at the most granular single-file level to further assist
space-saving.

So when calculating
the overall space saving versus the
best in-line solutions, consider: a) the total
amount of data to be backed up (typically more for larger enterprises), b) the
degree to which further replication is
to be applied to the de-duped backups (since with SEPATON these instances will
be smaller, which also helps performance especially if some travel over a WAN),
c) the effect of some files not being de-duped, and d) how long the data is to be
stored accessibly from disk (since the longer it is retained in this near-line de-duped
state the bigger the space-saving).

Logically, SEPATON’s
approach is most attractive to larger enterprises with larger and more complex backup
and archiving needs who do not mind a minimal amount of extra management. In
exchange SEPATON offers some enterprise-level additions.

For instance, there
is a rigorous byte comparison check on data integrity. Sandorfi says that SATA
disks have a habit of changing the data without showing an error. (Very nasty
if true!). Also, SEPATON’s ‘forward differencing’ approach reverses the way most
de-dupes work. Whereas they use the first instance of data as a reference copy and
replace subsequent instances by a pointer, DeltaStor stores the most recent data
copy in full form—replacing old and redundant data with pointers. This circumvents
two problems: a) a gradual tail-off in backup performance and b) a delay in restoring
the most up-to-date data.

In-line solutions
that cannot maintain wire-speed will impede initial back-up throughput
performance. Although not a like-for-like, SEPATON’s VTL appliance does boast up
to 34.5TB/hour as well as scalability to 1.6 petabytes of data.

Finally, through its
software being aware of the content, SEPATON is working to develop other
functionality, for instance to facilitate secure, audited content searches for
legal discovery. (But that is for another day.)

Right now the
decision for in-line or SEPATON-style partial
post-process depends on organisation size and needs. But I still await a
convincing argument for standard post process de-dupe.