When masking isn’t enough

Along with various other things (data migration, profiling, cleansing, governance and archiving) I am currently researching data masking with a view to producing relevant reports (NB: if you are a vendor in one of these spaces and I haven’t been in touch, please contact me). One of the companies I contacted with respect to data masking was a Canadian company called Privacy Analytics and, although it does masking, it has declined to participate in the research. The reason being that what they offer is so much more than masking that they don’t want to be tarred with a lower capability brush.

The company specialises in the healthcare market although its facilities would be equally suitable in financial services and other markets. The key point about Privacy Analytics is that last word: the software allows you to do analytics against the data even though the data has been de-identified. If you are a researcher in the healthcare market you want to know about things like where the patients live, what gender they are, how old they are, how long they were in hospital, which hospital they were in, what drugs they were prescribed and so on. If you just mask this data in a conventional manner then either you lose the ability to do this sort of research or you run the risk of the data being re-identified. More broadly, this doesn’t just apply to formal research but to any type of meaningful analytics: for example, insurance companies involved in the healthcare market might want to analyse claims by age, diagnosis and so on. And more broadly still, insurance companies might want to analyse all sorts of other claims information that have nothing to do with healthcare but which contains sensitive data. Finally, of course, there are lots of other such analyses in other industries.

What Privacy Analytics does is to split data attributes into two categories: direct attributes (things like names, phone numbers and so forth), which it masks using conventional techniques; and indirect attributes (the sort of data you want to research against), for which it uses statistical techniques. This matches the HIPAA requirements that allow safe harbour (masking) and/or statistical approaches (that is, you might use both or just masking – you wouldn’t use just statistical methods).

For these indirect attributes, also known as quasi-identifiers, Privacy Analytics provides a risk assessment capability. Here you can set a risk threshold against either the threat that a nosy neighbour (or journalist) will want to identify some particular person and/or against the possibility that someone will want to re-identify a group of patients en masse. The software will then set the granularity for each of the quasi-identifiers, as appropriate. For example, if one of your parameters for research is birth date then you might group by birth week or month or year; if a parameter is location then you might remove two, three or four characters from the end of the post code.

As far as I know there is no other vendor in the data masking space that has anything comparable to this. I know of no other supplier that has this sort of risk assessment capability and that supports the generalisation of quasi-identifiers to support research and analytic functions. Of course, many data masking vendors are focused on test data rather than supporting analytics but that only goes to demonstrate that most data masking tools are inadequate: you’d really like a product that does both. Privacy Analytics can do both. However, since it stands alone in supporting analytics you can hardly blame the company for focusing on this area rather than test data, where there is much more competition.