Data anonymization is the process of protecting private or sensitive information from data sets. This includes erasing or encrypting personally identifiable information (PII) that connects an individual to stored data. PII can include names, social security numbers, or addresses, among other data pieces, which can identify a person. These pieces of information are anonymized in order to protect the persons form whom the data originates. Data anonymization is an industry with growing interest as more regulations - such as the General Data Protection Regulation (GDPR) in the European Union - require organizations who collect, process, and store user data to protect and anonymize that data. Further, for companies looking to collect data, in some cases these regulations allow for the collection of anonymized data without user consent.
De-identification is the process of preventing an individual's identity from being compromised often by removing all PII. The most common method of de-identifying data is pseudonymization, which masks personal identifiers by replacing them with temporary identifiers. When applied to metadata or general data about identification, de-identification is also known as data anonymization. While data anonymization prevents further re-identification even by data controllers and under any conditions, de-identification may preserve identying information that is capable of being re-linked by a trusted party dependent on the given use case or situation.
In some cases, even using a data anonymization techniques, attackers can retrace the process to restore identifying data, especially if the data is not anonymized before it is processed through multiple sources when many organizations use data in multiple sources or platforms, some of which are available to the public. The de-anonymization or reidentification attempts to reattach a real-world person can be done in order to identify individuals and use their information to in some way compromise them (such as in identity theft). De-anonymization can be easier to do when an attacker is able to gather more data or data sets.
Common techniques for reidentification include data matching or data linking, where a data broker can gain a few pieces of information from one data set and use it to connect to other data points to develop a complete picture. Exploiting flawed anonymization techniques, such as deleting direct identifiers while leaving plenty of indirect identifiers can reveal sensitive information in which a data broker can connect to other pieces of data.
Similarly, if a data set is pseudonymized or anonymized using a poor or inferior technique, such as a simple character substitution or using a common key to assign pseudonyms, which allows an attacker to reverse these efforts if the key is found. All of these techniques have been further bolstered by advanced machine learning and data analysis technology which can make poorly anonymized data sets all the more vulnerable to de-anonymization.
Perhaps a central question of data anonymization is what data should be anonymized. Especially as not all datasets require anonymization. The simple answer is that datasets which contain sensitive information or PII need to be anonymized, and those which do not contain such information do not require anonymization. However, what counts as sensitive data can change according to an individual or a sector. For example, information seen as impersonal to a marketing agency may be viewed as highly sensitive to security experts.
This leaves room for interpretation, as what counts as PII in one industry may be seen as benigh in another industry. In this case, many industries and organizations rely on the definitions of PII found in a given territories regulation. More generally, the data which is generally considered to be sensitive enough to anonymize includes:
- Phone numbers
- Email addresses
- Social security numbers
- Financial information
- Biometric data
- Medical data
- Racial or ethnic dat
- Political opinions
- Sexual orientation
- IP address
Data anonymization is carred out by most industries that deal with sensitive information, especially the healthcare, financial, and digital media industries. These techniques are intended to reduce the risk of unintended disclosure when sharing data between countries, industries, and even departments within the same company. The anonymization of data can be done in various ways, including deletion, encryption, or generalization. Often these techniques can be used in concert in an attempt to better anonymize the data.
Data anonymization techniques
Data masking is the practice of hiding data with altered values. This can include creating a mirror version of a datbase and use modification techniques such as character shuffling, encryption, and word or character substitution, such as replacing a value character with a symbol. Data masking is intended to make reverse engineering or detection of data impossible.
This techique seeks to modify the original dataset by applying techniques that round numbers and add random noise. The range of values need to be in proportion to the perturbation, with a small base leading to weak anonymization but a large base reducing the data sets utility. For example, multiplying house numbers by fifteen can retain its value while using that same base for age values can expose the technique.
Also known as shuffling and permutation, this technique rearranges the dataset attribute values so they do not correspond with the original records. Swapping the attributes that contain identifier values may have more impact on anonymization than membership type values.
This techniques seeks to remove some of the data to make it less identifiable. Data can be modified into a set of ranges or a broad area with appropriate boundaries. This can be removing specifics, such as a house number in an address, without removing all the details to maintain some of the data accuracy.
Pseudonymization is a data management and de-identification method which replaces private identifiers with fake identifiers or pseudonyms. This is intended to preserve statistical accuracy and data integrity while allowing the data to be used for training, development, testing, and analytics without risking the individual's privacy.
Generally, it is considered a best practice in data anonymization is to use multiple layers of defense. Especially in the case of big data analytics one layer of anonymization is generally insufficient, and will require using multiple layers of protection in order to stop de-anonymization attacks. Some of these include:
- Database activity monitoring provides real-time alerts on policy violations in data warehouses, big data sets, data warehouses and mainframes, and relational databases.
- A database firewall evaluates known vulnerabilities and blocks SQL injections.
- Data discovery determines where data resides and data classificiation identifies the quantity and context of data on-premise and in the cloud
- Data loss prevention sofwtare detects potential data breaches by inspecting sensitive information while in use, in motion, and at rest.
- Data masking will render sensitive data useless in the wrong hands.
- User behavior analytics uses machine learning to establish a baseline for data access behavior and detect abnormal activity.
- A user rights management feature monitors data access and privileged user activity, and identifies inappropriate privileges.
There are obvious benefits attached to data anonymization methods. They increase user security and for many industries can increase user or customer satisfaction. However, other benefits can be identified and are in some cases emphasized to ensure compliance. These include:
- Protects against possible loss of market share and trust - where the potential loss of sensitive, personal, or confidential data can damage consumer and market trust.
- Safeguards against data misuse and exploitation risks - where data anonymization can ensure regulatory compliance by safeguarding against data misuse and potential insider exploitation risks.
- Increases governance and consistency - data anonymization can also increase the governance and consistency of results; where clean, accurate data can allow users to leverage apps and services and preserve big data analytics and privacy.
On the other side, disadvantages of data anonymization and regulatory compliance can include the limits placed on the information a company can collect. For example, when a company wants to gather personal information such as cookies, IP addresses, and computer IDs are required to receive permission from users and can restrict the amount of meaningful information they can extract from the results. In this example, the anonymized information also cannot be used for targeting purposes or personalizing the user experience. Also, anonymized data can be less coherent or meaningful, which can further reduce the potential insights derived from a dataset.