Accurate test data

Having accurate and complete (and masked where appropriate for compliance with data protection laws) test data is vital for the production, development and testing of new applications. However, despite this necessity this area has seen little innovation from the major purveyors of testing tools over the last few years: the common approach has been, and is, to sample existing data sources and use the results (suitably masked where necessary) for testing.

There is a logical flaw in this methodology: if you use a subset of the data for testing purposes then what you will get to test is all the normal operations that you have to run. Which is fine as far it goes. But what you won’t get to test, or not test all of, are the outliers: those exceptional and quirky data elements that only occur rarely but which are often the cause of faults, precisely because they don’t all get retrieved through sampling procedures and therefore don’t get tested. If you think about it, production data usually contains around 80% of the same/similar transactions. Imagine a bell curve: production database systems are in the middle of the curve while most production anomalies occur on the edges. It therefore makes sense to test new applications with as complete a dataset as possible.

However, if you are taking data from a production database then this is not a trivial task. Indeed, I am not so sure that it’s a good idea to even sample a production database. In particular, how would you do that against a mission critical 24 x 7 environment without significantly impairing performance or investing in replication software? Anyway, that’s another issue; my discussion point here is the test data itself.

Fortunately, not all vendors have been sitting on the laurels of ‘sample and mask’. One such is Grid Tools, whose Datamaker product will generate a complete dataset without the use of a source database, allowing testers to create more combinations of data (permutations of rows and columns) in order to exercise the code that has been written. Moreover, something that is often missed when using a sample and mask approach is that you actually want to have invalid data to test against to make sure that this is handled correctly also, so Datamaker will generate such data too.

Test data is stored in the Datamaker test repository along with test cases, which may be shared and reused by any number of testers (with each tester using a different version of the test case if appropriate) either through the use of a graphical Windows or web interface. In other words you can have a central team or person create test cases within the repository and then any number of developers and testers can access these cases on demand.

Note that you can use Datamaker in conjunction with third party testing tools from vendors such as Compuware so using Datamaker won’t do anything but help your testing efforts. Oh, and if you’re not convinced about the above arguments and you really love subset and mask: well, Datamaker supports that option too.

(This article was co-authored by Daniel Howard)