IBM Optim and Data De-Identification

Written By:
Published:
Content Copyright © 2008 Bloor. All Rights Reserved.

For many, IT security has been a long journey that is
starting to come to an end as perimeters (at least those that exist) are
secured, data is encrypted, viruses killed and leaks plugged using the latest
vendor offerings.

Unfortunately the journey is far from over, as those that
undertake systems implementation and development are finding out.

The issue? Data de-identification.

For those unfamiliar with the term some explanation is necessary.

Imagine that you run the software development function for a
bank. You employ teams of developers who cut code all day creating bespoke
applications for the various end user departments. Maybe trading solutions,
maybe back office process solutions.

How are you going to test the software?

Easy, simply take a cut of data from the production
database, populate the development server and start running the tests. This is
a practice that would be familiar to development shops all over the world and
happens all the time. The inherent problem this poses, from a security view
point, is that the cut of production data now sitting on the development server
is a full and frank copy of some, probably, very sensitive information. The
fact the data has been extracted from the production server and now sits in
software development almost inevitably means that it now resides below the
radar of the corporate security team and therefore represents a potentially
huge data leak waiting to be exploited.

For some this practice may seem a bit far fetched, but I
would suggest that for the majority of development shops, either internal
departments or external consultancies, this is exactly what happens day in, day
out. By the nature of software development, security is often seen as way down
the list of concerns. As long as the daily build does just that, and the latest
code is put into a fireproof safe overnight, few give a thought as to the
nature of the test data.

Some smarter developers see a way around this problem and
decide to create their own test data using some clever algorithms that churn
out random data sets for customer names or credit card details. In fact quite a
few database administrators have become quite adept at writing SQL code to generate
test data, maybe using a vendor’s sample database as seed values. The problem
with this approach is that it is very
one dimensional. How can you be certain that you have quality of data as
well as volume? Creating volumes of data is easy, it’s creating meaningful data
that actually looks and behaves as your production data would do which is tough.

Realistic data not only creates a system that looks and
works as it should it also helps engage with the end user customer as they get
to see the type of data they process daily.

Another difficulty with the DIY approach is how you create
meaningful data that stretches across a relational database structure,
preserving referential integrity and ensuring correct data types sit in correct
columns. In fact when you start to look over the complexities of the problem
populating a database with more than a few tables with meaningful, realistic
data that actually works is now looking like a very ugly herd of elephants
storming over the horizon.

Aside from internally developed applications we also need to
consider the implementation of solutions such as JDE, SAP and Siebel. These are
complicated implementations by anyone’s ranking and demand to be taken very
seriously when it comes to testing and deployment. For many, the only way they
can undertake proper testing would be to use production data and hope that it
remains secure inside the development team.

Security of the development team also needs consideration. The fashion for off shoring may or may not be
shrinking but the reality is that many corporate applications are developed
using overseas resources based in countries many development managers
commissioning the work have never even visited. How are these people supposed
to test their code? Fine, it could be sent back for testing but what if you
don’t have the resources available? For many companies it results in data being
sent overseas and ending up completely out of the control of the original
security team.

This is where data
de-identification comes into play.

This is the process of masking original data by scrambling
the source information so that it becomes useless as a data set, but still
retains the look, feel and consistency of the original data. The type of data
obfuscation that can result may be the random swapping around of first and last
names, the random substitution of certain credit card details or artificial
data aging. This is still extremely useful data for the system testers and
implementers but of no value to those tempted to run off with it as part of a
data theft scheme.

Optim, a product originally from Princeton Softech but now
part of IBM, provides just such a solution to large enterprises struggling with
the difficulties of testing solutions with production data. Using Optim, DBAs
and developers get more than a secure test data generator, they get system tools
that will look into the structure of the development database and ensure that
all referential integrity rules are maintained even though the data is being
obfuscated. This would be horrendous to undertake manually but using a solution
such as Optim this problem is automatically catered for.

As well as maintaining the look of data the Optim solution
ensures that data still passes elementary tests such as year of birth matching
a person’s age and that postcode/phone area code and address all reconcile, as
in real life.

Demands on security officers to ensure adequate governance,
regulation and compliance has focussed lots of energy on production systems but
I would suggest that few have considered the issue of development test data. In
fact the Payment Card Industry (PCI) regulations insist that credit card data
be masked during the software testing environment, so if you are subject to
these rules and not implementing data de-identification you are immediately
open to action.

With the final IBM acquisition of Princeton Softech now
complete this new business unit has an opportunity to take Optim forward under
the watchful eye of the IBM engagement engine. Opportunities will now
undoubtedly present themselves for the Optim team to work alongside the well
respected Rational business unit and create more demand for this somewhat
overlooked but vital area of IT security. I’ll watch its progress with interest.