Techniques promoting anonymization or pseudonymisation of sensitive data have flourished over the last years. However, the exact meaning of such words remains quite unclear for the greatest majority, including decision and policy makers. In particular, pseudonymisation is commonly seen as a sophistication of anonymisation, while it's the exact opposite! Let's clarify things a bit, through proper definitions and simple examples.
GDPR has provided helpful insights to better distinguish them, and states that pseudonymisation is "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information" . The last part of the sentence is capital, because it is the major difference with anonymization, which acts in such a manner that the data subject is not or no longer identifiable.
In particular, this means that pseudonymized data can be re-identified with the help of extra information. This has two major implications. First, in the case where identities in records are changed by random ids, if an index table is kept somewhere which helps matching back ids with identities, then this data is not anonymized but pseudonymized. Second, given records where most obvious identity markers have been removed, if there exist any public data which makes it possible to re-identify some records, then the dataset is only pseudonymized as well.
This means that it's hard in practice to prove that a dataset is properly anonymized and some companies have had bad experience of disclosing poorly processed records.
The main stumbling block is not direct identifiers like the name, the social security number or an ID photograph of individuals, which can easily be removed or randomized. The real issue has to deal with the indirect identifiers, like social connections, medical history or even the language spoken, which, taken alone, are not strong identifiers but combined together, can describe a very narrow population.
In 2006 for example, Netflix published a movie ranking dataset using data of over 500.000 users. Data was cleaned up and direct identifiers such as the name were removed. However, a significant part of this dataset was re-identified by researchers which compared the set of ratings and rating dates of this dataset with the IMDb public records . Because the way people rate movies is very personal and unique, this re-identification was surprisingly successful. In particular, they showed that with 8 movie ratings (of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of records can be uniquely identified in the dataset. Implications of such disclosure can be incredibly serious: movie ratings can reveal a lot about someone’s sexual and political orientation and Netflix was actually taken to court for privacy invasion . The main takeaway of this work is that identifiers can be hard to spot and that only a few attributes might be needed to de-anonymize datasets
There are many other scandals such as the disclosure of New York City taxi drivers personal details through the publication of anonymous taxi rides in 2014, or re-identification of patients for an open dataset of Australian medical billing records in 2016 which still has had implications recently .
If some failures are due to carelessness during data cleaning, careful processing may not be enough to guarantee robustness against re-identification, as the researchers showed when studying the Netflix dataset . This naturally leads to the following question: when can one state that a dataset is correctly anonymized, given the impossibility of knowing all the available data outside?
Several techniques exist that provide some privacy guarantees, but each of them have their weaknesses.
For example, k-anonymity provides robustness against individual re-identification by ensuring every attribute configuration describes a population of at least k individuals. Hence you can't distinguish a person from at least k-1 other ones from the features in the dataset. This is performed by suppressing direct identifiers and generalizing indirect identifiers. For example, if records contain the age of patients, they might replace it by the decade approximate (e.g. 20 < Age ≤ 30). k-anonymity is vulnerable against some attacks (like the so-called homogeneity attacks) and also supposes you have access to all the data to correctly generalize attributes without losing too much precision, which might not be possible for real time data. In addition, it is not clear how it applies to some data formats like text data.
To fight against some of the limitations for k-anonymity, additional sophistications, as l-diversity and many others, have been proposed . But at the end of the day, all these techniques have inherent limitations and aren't applicable to all types of data.
To put it simply: real anonymization is hard.
So, should we give up anonymization? Hum, not really.
First, if you know the weaknesses of each de-identifying technique, you can use them successfully on specific use cases and have sufficient privacy guarantees. Second, with the advancement of machine learning especially in the context of healthcare and banking, new techniques have been developed which might help enforcing privacy by shifting the privacy issue from the data to the model and analysis performed onto it. In particular, instead of adding noise or generalization processes directly on the data, those methods operate on the query or model being applied onto it, which is less destructive to meaningful information while ensuring higher privacy. This range of techniques are called Differential Privacy and they are a very dynamic area of research  .
If you want to hear more about Differential Privacy and other privacy-preserving topics, stay tuned and follow us on Twitter to be the first notified when the next blog post is out!