IBM Optim and Data De-Identification

silhouette of a person

Written By: Nigel Stanley
Published: 7th February, 2008
Content Copyright © 2008 Bloor. All Rights Reserved.

For many, IT security has been a long journey that is starting to come to an end as perimeters (at least those that exist) are secured, data is encrypted, viruses killed and leaks plugged using the latest vendor offerings.

Unfortunately the journey is far from over, as those that undertake systems implementation and development are finding out.

The issue? Data de-identification.

For those unfamiliar with the term some explanation is necessary.

Imagine that you run the software development function for a bank. You employ teams of developers who cut code all day creating bespoke applications for the various end user departments. Maybe trading solutions, maybe back office process solutions.

How are you going to test the software?

Easy, simply take a cut of data from the production database, populate the development server and start running the tests. This is a practice that would be familiar to development shops all over the world and happens all the time. The inherent problem this poses, from a security view point, is that the cut of production data now sitting on the development server is a full and frank copy of some, probably, very sensitive information. The fact the data has been extracted from the production server and now sits in software development almost inevitably means that it now resides below the radar of the corporate security team and therefore represents a potentially huge data leak waiting to be exploited.

For some this practice may seem a bit far fetched, but I would suggest that for the majority of development shops, either internal departments or external consultancies, this is exactly what happens day in, day out. By the nature of software development, security is often seen as way down the list of concerns. As long as the daily build does just that, and the latest code is put into a fireproof safe overnight, few give a thought as to the nature of the test data.

Some smarter developers see a way around this problem and decide to create their own test data using some clever algorithms that churn out random data sets for customer names or credit card details. In fact quite a few database administrators have become quite adept at writing SQL code to generate test data, maybe using a vendor's sample database as seed values. The problem with this approach is that it is very one dimensional. How can you be certain that you have quality of data as well as volume? Creating volumes of data is easy, it's creating meaningful data that actually looks and behaves as your production data would do which is tough.

Realistic data not only creates a system that looks and works as it should it also helps engage with the end user customer as they get to see the type of data they process daily.

Another difficulty with the DIY approach is how you create meaningful data that stretches across a relational database structure, preserving referential integrity and ensuring correct data types sit in correct columns. In fact when you start to look over the complexities of the problem populating a database with more than a few tables with meaningful, realistic data that actually works is now looking like a very ugly herd of elephants storming over the horizon.

Aside from internally developed applications we also need to consider the implementation of solutions such as JDE, SAP and Siebel. These are complicated implementations by anyone's ranking and demand to be taken very seriously when it comes to testing and deployment. For many, the only way they can undertake proper testing would be to use production data and hope that it remains secure inside the development team.

Security of the development team also needs consideration. The fashion for off shoring may or may not be shrinking but the reality is that many corporate applications are developed using overseas resources based in countries many development managers commissioning the work have never even visited. How are these people supposed to test their code? Fine, it could be sent back for testing but what if you don't have the resources available? For many companies it results in data being sent overseas and ending up completely out of the control of the original security team.

This is where data de-identification comes into play.

This is the process of masking original data by scrambling the source information so that it becomes useless as a data set, but still retains the look, feel and consistency of the original data. The type of data obfuscation that can result may be the random swapping around of first and last names, the random substitution of certain credit card details or artificial data aging. This is still extremely useful data for the system testers and implementers but of no value to those tempted to run off with it as part of a data theft scheme.

Optim, a product originally from Princeton Softech but now part of IBM, provides just such a solution to large enterprises struggling with the difficulties of testing solutions with production data. Using Optim, DBAs and developers get more than a secure test data generator, they get system tools that will look into the structure of the development database and ensure that all referential integrity rules are maintained even though the data is being obfuscated. This would be horrendous to undertake manually but using a solution such as Optim this problem is automatically catered for.

As well as maintaining the look of data the Optim solution ensures that data still passes elementary tests such as year of birth matching a person's age and that postcode/phone area code and address all reconcile, as in real life.

Demands on security officers to ensure adequate governance, regulation and compliance has focussed lots of energy on production systems but I would suggest that few have considered the issue of development test data. In fact the Payment Card Industry (PCI) regulations insist that credit card data be masked during the software testing environment, so if you are subject to these rules and not implementing data de-identification you are immediately open to action.

With the final IBM acquisition of Princeton Softech now complete this new business unit has an opportunity to take Optim forward under the watchful eye of the IBM engagement engine. Opportunities will now undoubtedly present themselves for the Optim team to work alongside the well respected Rational business unit and create more demand for this somewhat overlooked but vital area of IT security. I'll watch its progress with interest.

Post a comment?

We welcome constructive criticism on all of our published content. Your name will be published against this comment after it has been moderated. We reserve the right to contact you by email if needed.

If you don't want to see the security question, please register and login.