I gained a more positive view of post-process de-dupe—or rather what I would call ‘partial post-process'—from meeting with virtual tape library (VTL) appliance provider SEPATON last week. Its new DeltaStor de-dupe approach is unique and so deserves a separate review.
De-duplication performed during an initial backup—‘in-line' so called—is typically achieved transparently to any management process by an appliance and (with compression included) can achieve a 20x or more space saving over a standard backup. Applied across the board to every file and system, it typically treats them all just as blocks of data without taking account of file type or content. ‘Post process', which applies de-dupe to a backup only after it is created, initially requires extra space and typically incurs some management overhead; this is not so smart in my book.
SEPATON's DeltaStor is technically ‘post process' but different. Its software examines the backup copy of each individual file and database (‘object') in turn but, uniquely, uses its ContentAware stored intelligence to recognise all the leading vendors' backup and archive output as these embed their own markers. In SEPATON's de-dupe process these markers are extracted before the data is processed.
There then follows a byte-level examination of the whole data stream; from this the de-dupe process (which does not use hashing) creates variable-length output representing anything from 128 bytes to the whole object. "Nobody else does that," said Miklos Sandorfi, SEPATON's CTO, who pointed to a verified 48x space-saving typically being achieved in its VTL output. It still needs additional space but, as Sandorfi explained, far less than you might expect...
Since each backed up file or database is handled as a separate entity, DeltaStor can be set to start work on de-duping the first file as soon as that backup is complete and concurrently with the next file backup, and so on (so effectively only ‘partial post-process'). This also means the minimum amount of output space that has to be pre-allocated is the size of the largest file to be backed up plus the total de-dupe output space (which all de-dupe products need); then remember that DeltaStor's de-dupe space will come out less than half that used by the best in-line de-dupe products. Some files should not be de-duped (for instance already encrypted ones); with DeltaStor's approach the decision whether to de-dupe can be set at the most granular single-file level to further assist space-saving.
So when calculating the overall space saving versus the best in-line solutions, consider: a) the total amount of data to be backed up (typically more for larger enterprises), b) the degree to which further replication is to be applied to the de-duped backups (since with SEPATON these instances will be smaller, which also helps performance especially if some travel over a WAN), c) the effect of some files not being de-duped, and d) how long the data is to be stored accessibly from disk (since the longer it is retained in this near-line de-duped state the bigger the space-saving).
Logically, SEPATON's approach is most attractive to larger enterprises with larger and more complex backup and archiving needs who do not mind a minimal amount of extra management. In exchange SEPATON offers some enterprise-level additions.
For instance, there is a rigorous byte comparison check on data integrity. Sandorfi says that SATA disks have a habit of changing the data without showing an error. (Very nasty if true!). Also, SEPATON's ‘forward differencing' approach reverses the way most de-dupes work. Whereas they use the first instance of data as a reference copy and replace subsequent instances by a pointer, DeltaStor stores the most recent data copy in full form—replacing old and redundant data with pointers. This circumvents two problems: a) a gradual tail-off in backup performance and b) a delay in restoring the most up-to-date data.
In-line solutions that cannot maintain wire-speed will impede initial back-up throughput performance. Although not a like-for-like, SEPATON's VTL appliance does boast up to 34.5TB/hour as well as scalability to 1.6 petabytes of data.
Finally, through its software being aware of the content, SEPATON is working to develop other functionality, for instance to facilitate secure, audited content searches for legal discovery. (But that is for another day.)
Right now the decision for in-line or SEPATON-style partial post-process depends on organisation size and needs. But I still await a convincing argument for standard post process de-dupe.