The vast majority of organisations handle sensitive data on a regular basis. But many – perhaps even most – do not know where all of it is or which data it is. Personal customer information, for instance, may spring to mind as an immediate example of sensitive data, but even then, it is not unusual for it to be haphazardly strewn across multiple data stores and systems without any kind of centralised knowledge or authority. Other kinds of sensitive data include employee information, company financial information, intellectual property, marketing plans, new product launch details, and so on, and the same applies to them.
Moreover, it is not always obvious when data is sensitive: data that does not appear to identify an individual may in fact do so when combined with other pieces of data. For example, researchers at the University of Texas have taken de-identified data released by Netflix and compared it with movie reviews on a third-party web site, achieving a 68% re-identification success rate. Even worse, some sensitive data carries serious risk of reidentification even after it has been anonymised, due to the inherent nature of that data: geospatial data, for example.
This a serious problem, for a number of reasons. The most significant is that you can’t take steps to protect your sensitive data if you don’t know what data is sensitive in the first place. In addition, significant swathes of the world enforce government-backed compliance regulation (GDPR, for instance) that require you to know (and prove that you know) what sensitive data you are storing, protect it, and produce, amend or delete significant parts of it at effectively any time. In this sense, sensitive data discovery is a foundational step for data governance in general and regulatory compliance in particular.