DATPROF Subset is used to create subsets of your existing production data for testing purposes. They are generated using a single, driver table as a start point, with other tables included based on their relationship to that table. These relationships can be derived from existing database relationships or specified manually, and the process is assisted by intelligent suggestions for which table content should be included in full, as opposed to in part.

Fig 01 - Process models in DATPROF Subset
The results can be visualised as either a data or process model (both shown in Figure 1). These are helpful for understanding your database’s structure, and thus how best to create your subset. Various validation techniques are provided to facilitate this process. Options exist to either completely refresh your test database or to append new test data cases to your existing data content, and duplicate data is handled appropriately while ensuring all constraints remain valid.
DATPROF Privacy is a rule-based data masking solution with native support for Oracle, SQL Server, PostgreSQL, MySQL, IBM DB2 and MariaDB. It can, in theory, support any other data source via a processing engine (which is to say, one of the aforementioned databases) and it can mask data stored in a variety of formats, including CSV and XML. Notably, it masks live data in-situ, meaning you never need to move or extract it for the purposes of masking. Masking rules can be customised or leveraged out of the box, and can be applied in a specific order by setting dependencies. The product masks consistently over all of your systems and applications, and it delivers meaningful audit reports on your data masking and subsetting actions.
DATPROF Privacy also provides the company’s synthetic data generation capability, compatible with all of the data sources listed above. The product provides a selection of replacement data candidates and algorithms out of the box, including logical generators, weighted lists, regular expressions, generators that leverage seed data, and more. You can also build your own, using custom database functions, multi-column seed files (for example, a correlated seed list) and “generator expressions” that allow you to combine other types of generator into a bespoke formula, among other things.
Synthetic data is generated directly in the database, in a uniform fashion for all major databases, and either during or after masking depending on whether you want to add data to your subset or replace data that’s already there. Data is created against “generation sets” of tables, with each column in the table assigned one of the generators described above. Various configuration options are available on each column, including the percentage of null values to generate. Generated values can be combined to create a fully synthetic data set (for example, concatenating first and last names to get full names), columns can be earmarked to generate simultaneously in order to preserve correlations, and foreign key relationships can be discovered and included in your generated data automatically.

Fig 02 - Test data monitoring in DATPROF Runtime
Finally, DATPROF Runtime (see Figure 2) allows you to centrally configure, manage and monitor your test data user-groups, their databases, and the masking and subsetting applications available to them. A REST API is provided to help facilitate this.