Njini-ous ways to reduce unstructured data storage

If every single piece of data a business accumulates had real value to it now and/or in the future, then most businesses would find the budget and the space for it all to be kept. This could only be achieved if everything of no value could be jettisoned straight away.

Life’s not like that. The vast majority of stored information will never be used, but the problem everyone has is identifying what can be safely discarded or moved to low-cost archive versus what is of real value or needed for compliance purposes. Most data nowadays is unstructured and the number of disparate file types has multiplied in recent years.

“If you don’t know what it is you have to store it,” goes the argument—so most companies do just that, not knowing how to tackle head-on the underlying cause of the problem. Only a few genuine solutions to this are on the horizon at present. They usually involve software creating metadata about information in files as they are created or received by the business; this then gets updated as needed if the information itself is updated, until both can finally be discarded.

Once metadata is present, a set of company policies regarding stored information can be implemented automatically in software rather than, at best, applied piecemeal in costly processes involving some manual intervention. A policy engine in software examines this metadata and acts on it according to the policies the business has provided—to decide in real-time to which tier of storage to place it or move it.

So far there are few players who properly address this issue—and it will be a few years before a standard metadata format is established. But one company deserving an honourable mention at this time is UK-based Njini. Its software resides in-band within a NAS or SAN, creating its form of metadata in real-time as files are received, then steers these files to the correct storage tier based on previously-established policies.

The degree of success in doing this is obviously dependent on an organisation having a sufficient number of good policies covering a high enough percentage of the total storage to make a noticeable difference. By having something of a handle on what the information in a file contains, the software may also help in, for instance, automating decision-making about moving data that has not changed for over 90 days. Is there a good reason based on business value or a compliance need why it has to stay in a high-performance, high-cost storage tier? If not, the policy can trigger its auto-removal to a lower-cost archive. This is true ILM I think.

Njini’s software has some less obvious features that add to its effectiveness. Unlike e-mail archiving, which saves large amounts of space by holding just one copy and providing a link for all recipients, other unstructured files also used by multiple users within the same storage pool might be updated by one of them (but this is the exception); so each copy is held separately. Yet, if there is a policy for it, Njini software can inspect duplicate files in real-time and de-duplicate them transparently to their users. This is possible only because it will also detect a change subsequently being made and so immediately cause a re-duplication when needed; Njini points out that out-of-band software could not do this.

Are these things a big deal—especially for a major enterprise? Well, if we go with Njini’s own estimates, over 80% of a company’s unstructured file data has not been accessed for over 90 days and unstructured file duplication of the type just described may equal 40–60% of information storage within a shared storage area. If these percentages are anywhere near the mark—but also very much dependent on how much the policies being set up exploit this—this could indeed be a big deal for some.