Discovering and Identifying Sensitive Data

The exposure of sensitive data in documents can create serious problems. Some types of information, such as Tax File Numbers (or other national identification number) and credit card numbers, are inherently sensitive. They always need to be kept out of public documents, and any internal documents that hold them need protection. Regulations and contractual obligations create a need to keep other types of data safe. New regulations like the EU GDPR require businesses to know what data they have about EU citizens and can produce and delete that information on-demand. It is impossible to meet these requirements with knowing where data exists.

This kind of information can slip into documents that the public can reach or ones that an intruder can access with very little effort. Training helps to cut down on this, but people will make mistakes. A document’s purpose may change. One that was originally intended as private may get sent to a business partner and require redaction.

Some regulatory systems impose huge costs as a result of breaches caused by non-compliance or negligence. Disclosing various kind of personal can cost millions under the Notifiable Data Breach scheme in Australia, HIPAA in the US, and PCI DSS anywhere in the world.
While databases are relatively easy to control, free-form documents such as email, memos, and presentations are a bigger problem. Even in databases, some fields allow notes of any kind and might hold uncaught sensitive information.

A program to catch sensitive data is needed in order to avoid costly disclosures and wasted resources. Approaches can range from the purely manual to the highly automated, though some human intervention is always necessary. Each approach has its advantages and disadvantages.

Manual inspection

The simplest approach from a technical standpoint is to have someone review all documents for sensitive information. Done well, it produces accurate results, with few false positives – but doing it well is a problematic. Someone assigned to review a lot of documents can get careless, though, and miss some data that should be flagged. People are generally bad at doing repetitive and mundane tasks.

Humans can recognize different contexts. An individual’s name by itself normally isn’t sensitive information. In business records, it could be moderately sensitive or highly confidential. With personal health information under HIPAA, just disclosing a patient’s name can be a significant violation.

The big problem in manual review is keeping up with all the files. Many kinds of documents could pose a risk of disclosure. With a review process that depends on human speeds, it’s tempting to skip over or forget some categories.
Privacy can be an issue. Someone must get access to all those documents, raising the question of who should be authorized to do it. Authorizing one employee to read so many documents may not be a wise idea.


A person who knows a scripting language such as Python or PHP can create scripts to find patterns that suggest sensitive data. Social Security numbers follow the pattern xxx-xx-xxxx. Government-issued ID numbers in each country have their own patterns. Credit card numbers generally have sixteen digits in groups of four. Each credit issuer owns a specific range of numbers, so it’s possible to narrow down the search. Other data types, including a company’s own identification numbers, will have their own patterns that can be matched.

Regular expressions, which every serious scripting language supports, aid in writing code to match patterns. A flexibly written pattern can deal with dashes and multiple spaces as separators and other variants.

Scripts don’t take a lot of effort, but they’re limited in the kinds of data they can identify. Data types which don’t have a predictable and regular form are hard to nail down under regular expressions.

A script can look for keywords, such as “password,” in addition to numeric patterns. This will produce many false positives, but if the only consequence is a request to review the document, that may be acceptable.

Scripts are often limited in the file types they can access. It’s possible to add more types, but each type means extra work. They’re a quick and dirty solution, which can be a reasonable approach on a small scale.

Context-based identification

Better results come from software which is designed to discover sensitive information in a context-based way. This covers a range of approaches. At a minimum, it correlates multiple cues to give an estimate of likelihood. A long string of digits in proximity to the phrase “account number” is likely to be an account number.

Strings like “username” and “password” may indicate confidential data, but they often don’t. Software can exclude uses whose context makes them probably safe, such as “Enter your username and password.” Some patterns, such as all zeroes, are almost certainly dummy strings. The software can exclude them from flagging as sensitive, reducing the number of false positives.

Software which has gone through development and testing has other advantages over throwaway scripts. It covers more file formats. It generates reports and alerts in a user-friendly form, including analytics that can help to determine if compliance is improving or problems are growing. It’s less likely to crash or output nonsense when it encounters something unusual. In general, it offers more confidence in its results.

Artificial intelligence techniques can recognize the subject matter of a document. This lets the software make a better estimate of its likelihood of containing confidential data. If it focuses on one or more individuals, the software can lower its threshold for flagging information as possibly sensitive. If it discusses general topics and doesn’t contain people’s names, the threshold can be higher.

Other uses of data discovery

Once data discovery is in place, it has many more uses besides data security and privacy compliance.
By applying a different set of rules, it can aid in locating relevant information in archives. A simple text search won’t find all the needed information when you aren’t sure exactly what strings to search for. It can search for types of data, such as information on income and expenditures. Locating all documents related to a topic can help to find elusive answers about past events.

Discovery can reduce the chaos when reorganizing an unstructured collection of documents. Consider a pile of letters that has just been scanned and OCR’ed (Optical Character Recognition). Putting it all into a single directory makes it hard to use. Software that extracts dates and names from them makes it possible to add metadata that will greatly facilitate its use.

Cloud migration and other digital transformation become more informed and efficient because of upfront data discovery. Maybe your organisation wants to find files that haven’t been accessed in 3 years and move those to the Cloud first. Maybe it wants to delete files that haven’t been access in over 4 years. Why pay to store information you don’t need?

No single method of discovery will produce 100% accuracy 100% of the time but some approaches are better than others. Good data discovery software will locate a large proportion of whatever is needed, letting employees confirm or refine a small subset of the overall data. Even if you don’t have clear rules that define types of data, artificial intelligence (machine learning) can identify patterns that human cannot. Whether it’s to avoid information leaks, chase down elusive facts, or organize and tag files, data discovery software allows better management of unstructured information.