Sensitive issue: how to handle sensitive personal data under the GDPR
Prior to GDPR, if you got hit with a fine and you were one of the bigger processors, it was a rounding error; it would barely pay for the Christmas party. Now, you've got fines that are close to a billion euros," Ross McKean, chair of DLA Piper's UK data protection and security group, told CNBC earlier this year. That's no exaggeration. According to new research by the law firm, fines for violations of the EU's privacy law have increased sevenfold in the past year. In total, the bloc's data protection authorities have handed out fines to the tune of $1.25 billion since last January, up from about $180 million a year earlier.
GDPR has certainly been effective in making everyone sit up and listen to data protection law and enforcement, McKean concludes. That's without a doubt – but advances in technology and, more specifically, data analytics make it harder and harder to cut through the noise.
Insurance companies, for example, are now developing insurability scores and models based on vast aggregates of publicly and privately available data, explains information governance expert George Tziahans. These datasets make up some of the most expansive views of an individual's habits, practices, and personal information but also add a whole new level of information security-related risks. As Tziahans points out: "All this data creates value in developing risk models and serving customers. But it also generates a tremendous amount of highly sensitive private information."
With the fourth anniversary of GDPR entering into force fast approaching, let's revisit two fundamental concepts of the GDPR: sensitive personal data and personal data, including some clear-cut sensitive personal data examples and some that are more open to interpretation, plus how to lawfully handle special category data.
What is sensitive personal data – and just as importantly, what is it not?
Sensitive personal data is a special category of personal data that reveals a person's racial or ethnic origin, financial status, political opinion, philosophical belief, religion, trade union membership, or sexual orientation, or concerns their health and sex life, genetic data, or biometric data. It enjoys extra consideration and protection in GDPR, as it may give rise to stigmatization or discrimination in society. Many people think that GDPR addresses only sensitive personal information. Under the GDPR's definition of personal data, however, it is any information that is attributable to a specific individual independently of the nature of the information.
Nevertheless, the distinction between sensitive and non-sensitive data can be just as blurred as between personal and non-personal data.
For instance, recent research claims that it is possible to predict sexual orientation from facial photos with 81% accuracy. So, should facial photos be sensitive? Technically, it depends on the confidence of the prediction. Is electricity consumption sensitive? It can be, as the consumption is often shaped by religious activities (e.g., Jewish people consume less electricity on Sabbath). Is a video of a person sensitive? It can be, as one can compute the pulse rate from the variation of skin color using sophisticated image processing techniques, and abnormal pulse rate can be a precursor of many diseases.
According to the GDPR, all this data reveals information about a person's health, sex life, or even religion. Hence it should be considered sensitive.
Sensitive data vs. personal data vs. confidential data: what's the difference?
In the GDPR, personal data is defined as any information related to an identified or identifiable natural person.Think name and surname; home address; ID card number; the location data on a mobile phone; IP address; cookie ID; advertising identifier, and so on. For example, if a medical dataset contains the patients’ name, hometown, and medical diagnosis, then a record (or “row”) within this dataset is personal data if the patient who this record is about can be re-identified, meaning that anybody who has access to this dataset is able to associate the record with the patient.
But the first rule of GDPR data classification is: not everything is what it seems to be.
Rec# | Name | Hometown | Diagnosis |
1 | Jerry Bilbo | LA | Meningitis |
2 | Tom Sawyer | NO | Prostate cancer |
3 | Carl Schwartz | LA | Bronchitis |
4 | John Smith | NYC | Alzheimer |
… | … | … | … |
Table 1. Perhaps non-personal data
Rec# | Hometown | Diagnosis |
1 | LA | Meningitis |
2 | LA | Prostate cancer |
3 | NO | Bronchitis |
4 | NYC | Alzheimer |
… | … | … |
Table 2. Perhaps personal data
At first sight, Table 1 contains personal data due to the names stored in every record. However, this is not always the case as there can be several natural persons with the same name in a population (e.g., in a city) but not all of them are included in the dataset. For example, if there are several persons named John Smith in NYC, then Record #4 may not be personal data, if the attackers accessing the dataset cannot single out the flesh-and-bone individual from the population who Record #4 is about.
On the other hand, a dataset which does not contain personal names can still be personal. For example, consider an attacker who has access to Table 2 as well as Table 3, where the latter contains some demographic data of people including the patients in Table 2. The attacker can see that there are three individuals named John Smith in NYC, who are aged 26, 35, and 65, respectively. Therefore, Record #4 in Table 2 belongs to Record #3 in Table 3, since Alzheimer is very rare in age 26 and 35.
This means that the attacker probably re-identified a person named John Smith in Table 2 assuming that Table 3 contains all individuals named John Smith from NYC. The hometown and birth date together is the identifier of John which allows his unambiguous re-identification in Table 2.
Rec# | Name | SSN | Hometown | Date of birth |
1 | Susan Smith | 2346758913 | NO | 12/12/1965 |
2 | John Smith | 4545454323 | NYC | 01/03/1991 |
3 | John Smith | 8375835937 | NYC | 08/05/1952 |
4 | John Smith | 3548469234 | NYC | 28/11/1982 |
5 | Ursula Mayden | 3484756773 | LA | 30/11/1954 |
… | … | … | … | … |
Table 3. All people in a city
The following question immediately arises: if the second attacker can re-identify the person behind Record #4 of Table 2, but the first attacker cannot do it without Table 3, then is this record personal data according to the GDPR? The answer depends on the plausibility of the second attack. If it is plausible that the second attacker can access Table 1 and Table 2 also, then Record #4, and hence Table 1 are regarded as personal data of John Smith, who is identified by his name and hometown.
Sensitive personal data is a subset of personal data, covering special categories of personal data that are subject to specific processing conditions and must be handled with extra security. Examples of sensitive data types include trade union membership, biometric data, such as face, voice, palm, retina, or ear shape recognition, health data, such as medical history or fitness tracker information, genetic data, such as DNA and RNA, and any information revealing someone's racial or ethnic origin, political opinions, religious or philosophical beliefs, or sexual orientation.
This brings us to the topic of confidential data and, more specifically, how it differs from sensitive data.
Confidential data is a broad categorization of any information of commercial value in which disclosure, alteration, or loss could cause substantial harm to the competitive position of the data holder. Consequently, confidential data includes much more than personal data. In any organization, contracts, accounting information, business processes, and even source code should be viewed as confidential data. Confidentiality is mainly attributed based on the potential harm it can cause to the data holder should it get into the wrong hands.
Complex and unstructured data: it's complicated
Circling back to the dilemma of data aggregation and its sensitive data protection implications, it's important to note that in the age of big data, most datasets merge a myriad of user attributes, from what they've bought and where to what card they've paid with. The more sophisticated the record, the more reidentifiable the person.
For example, consider a dataset of credit card transactions, meaning the day and location of credit card payments, including four transactions by John. If a hacker learns the approximate location and day of John's four transactions, it is very likely that only his record has these four transactions. Therefore, even if John’s name, address, and account/card number are removed from the dataset (i.e., it is "pseudonymized"), an attacker who knows four of John’s transactions will find John’s record, as, most likely, only this record contains these 4 transactions in the dataset.
So, is it possible that an attacker could get to John's record this way? Very much so. Research has shown that four transactions can make a unique record reidentifiable with 90% accuracy in a dataset of 1.1 million individuals. But that's not all. Another study has found that 6-8 movies that a person has watched can also serve as an identifier in a dataset of 500,000 Netflix users. For reidentification, data records can be linked to IMDB, where people often review movies using their real names.
But let's not forget that structured data is nothing but one piece of the puzzle in the data picture.
As opposed to its structured counterpart, unstructured data isn't or can't be classified or organized, such as emails, customer reviews, blog posts, social media statuses or video, audio, and log files. For example, an article headline that reads "A person sells his Mustang in Innsbruck" can be personal data, especially if there is only one person in the whole of Innsbruck with a Mustang. So can the source code of a software, even without direct authorship information, as coding style is often unique to a developer.
Legal conditions for processing sensitive personal data
According to the European Commission, organizations can only process sensitive data if one of the following conditions is met:
- The explicit consent of the individual was obtained;
- An organization is required by law to process the data;
- The vital interests of the individual are at stake;
- They're a not-for-profit body processing data about members;
- The personal data was made public by the individual;
- The data is needed to resolve a legal claim;
- The data is processed for substantial public interest, for the purposes of preventive or occupational medicine, for reasons of public health interest, or for archiving, scientific, historical, or statistical purposes.
More details on the GDPR conditions for processing can be found here.
To get a common misconception out of the way: gaining consent is not the be-all and end-all of GDPR-compliant data processing. It actually should be your last resort if no other legal basis exists for data handling. For example, you might need someone's personal data to fulfill a contract, as it's often the case. If you're processing this data based on your customer's consent, you'll be unable to perform your contractual obligations the minute their consent is withdrawn.
Technical controls to keep sensitive data as safe as possible
1. Minimize the risk of data exposure with end-to-end encryption
The GDPR recommends encryption to secure data against exposure. However, not all encryption solutions provide the same protection in case your files get into the wrong hands. For the strongest protection, encryption keys should be controlled by the end-user, and they should not be accessible to the service provider at any point in the encryption and decryption process.
With Tresorit's end-to-end encryption, the encryption keys that unlock your data are stored on your device. Unlike in-transit or at-rest encryption, even with key management modules, we never encrypt and decrypt your data on our servers. Tresorit can never access the personal data stored in your files, only you and those who you share it with.
2. Use pseudonymization and anonymization
Pseudonymization means replacing or removing information in a data set that identifies an individual in a way that the personal data can no longer be attributed to a specific data subject without additional information. It's a great technique to add a further layer of security to your data assets, but that's all it is. Pseudonymized personal data is still personal data.
Anonymized data, on the other hand, isn't. Anonymization is an irreversible process that makes it impossible to identify the data subjects, directly or indirectly. Tools for anonymization of personal data include data hashing (one-way hashing) and encryption techniques where the decryption key is discarded. Bear in mind. However, that anonymization also devalues data.
3. Have a bulletproof identity and access management strategy in place
The GDPR requires that only those who need to work with personal data should have access to it. With Tresorit's permission settings, you can guarantee that personal data is shared with only those who require it for their job and everyone on your team is on the same page when it comes to using crucial data security tools.
A central dashboard to control file management, Tresorit's Admin Center helps you keep an eye on what happens to files containing personal data within your company, set up and enforce security policies, and manage company-owned devices. Password protection, download limit, and an expiry date provide further protection for confidential documents when shared in or outside your organization.
About the author
Gergely Acs is an Assistant Professor at the Budapest University of Technology and Economics (BUTE), Hungary, and member of the Laboratory of Cryptography and System Security (CrySyS). Before joining CrySyS Lab, Gergely was a research scholar and engineer at INRIA, France. His research focuses on different aspects of data privacy and security including privacy-preserving machine learning, data anonymization, and data protection impact assessment (DPIA). He received his M.S. and Ph.D. degrees from BUTE.
This post is an updated and expanded version of the original which was published on in 2017. The last update occurred on June 15, 2023, by the Tresorit Team.