Eversince I stumbled upon the Open Data Institute as I mentioned last week, I’ve been following what they do and reading their blog, research papers, opinions etc. Thinking whether I should I subscribe as an individual member so that I get to be part of their solution on how to build an open, trustworthy data ecosystem. Data is a subject close to my heart as you can read from my other related posts here and here (if you have not and new to my blog!).
Anyway, ODI recently in their blog talked about anonymisation and synthetic data which are techniques to remove identification of personalities from personal data so that the data can be shared either openly or closely but to third parties for a good use which needs to be defined by data stewards. But the question that I have until today, is from legal’s perspective, are anonymised/hashed data considered as personal data still? If you ask me, the answer is no, because you can’t simply unearth the person’s name or information from the anonymised data.
Here’s a sample of how a raw data transformed into anonymised data, say for example I have access to number of hours logged on Instagram by a set of people.
|Name ID||Date of Birth||Number of hours logged on Instagram per day|
|Name ID||Age-Range||Number of hours logged on Instagram per day|
Now tell me, from the anonymised table above, how can I tell if Melissa D spent 3.2 hours a day on Instagram? I can’t, hence it shouldn’t be classified as personal and hence the data can be shared. But I can’t overrule the law, can I? Based on what I’ve read so far, the answer to this is still very vague but it’s not as rigid as we thought it is.
Below is the definition stated by the most stringent law when it comes to personal data which I extracted from ODI’s report on Anonymisation and Open Data:
The EU General Data Protection Regulation (GDPR) defines personal data as:
any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person
The UK’s Office for National Statistics defines private information as: information that
- relates to an identifiable legal or natural person, and
- is not in the public domain or common knowledge, and
The key word here is identifiable. If it is not identifiable, then it should no longer be deemed as personal data, at least that’s how I would interpret it.
Why does this question matter and the answer needs to be transparent so that more data sharing can be done for value creation purposes by corporations? So that we don’t need to spend hours arguing with our traditional lawyers haha.
Jokes aside, it matters so that we can focus on the real work than the risk we carry by processing, sharing and using the data. It matters because you can adopt all kinds of techniques or tools to reduce the risk of personal data re-identification, but if the law remains vague, it’s difficult for traditional-mindset companies to innovate and harness the power of big data in this so called data-hungry world.
ODI’s report on Anonymisation and Open Data also highlighted that Syntethic Data which is created by an automated process such that it holds similar statistical patterns as an original dataset, can contain no personal data even though it is based on a dataset that holds personal data. The automated process is done by a machine learning method called deep-learning, a method which has gained fame in recent years and utilized by some of the big players in the US and China (and as you may know, the method which can make or break self-driving cars). If this holds true, then I would have the answer to my question on is anonymised data personal data.