Personal Data in India: anonymise and forget

By Sohini Chatterjee

Recent developments have indicated the government’s plan to intensify existing efforts towards an open data ecosystem. Most recently, the Economic Survey 2018-19 has considered the establishment of a central welfare database of citizens to integrate fragmented data about all Indians. For example, such a database may consolidate a citizen’s land title records, tax records and medical records which are currently maintained by separate government agencies.

Previously, the NITI Aayog’s National Strategy for AI has also emphasised data sharing through the convergence of siloed public datasets. These proposals aim at making it easier to harvest and analyse large amounts of data, and hinge on the social benefits of free flow of information.

A data convergence of this nature poses significant privacy and security concerns. In response to this, data anonymisation has been presented as a silver bullet that can protect privacy while enabling broad dissemination of data. The Economic Survey has assured us that data on the open government data portal is “completely anonymised and aggregated”. Statements by the IT minister echo a similar faith in data anonymisation as a privacy safeguard.

This approach views anonymisation as a tool to balance conflicting interests in open data versus individual privacy protection. However, anonymisation as a privacy panacea merits scrutiny.

What is data anonymisation?

Anonymisation refers to the various processes and techniques aimed at making it impossible – or, at least, very difficult – to identify a particular individual from collected information related to them. This is done by removing personally identifiable information, like name, Aadhaar number, e-mail address and photograph from data stored by a data fiduciary or a third party. For instance, a bank that wishes to disclose customer data to a third party may anonymise it by removing names, account numbers and Aadhaar numbers. The anonymised dataset may retain dates of birth, sex and PIN codes of its customers. Data anonymisation is a global best practice in data protection that is widely believed to have a privacy enhancing power.

The Srikrishna Committee’s draft Personal Data Protection Bill, 2018 has defined anonymisation as the irreversible process of transforming personal data to a form in which a data principal cannot be identified. The draft bill itself does not contain the standards of anonymisation, and it has been left to the Data Protection Authority (‘DPA’) to specify the same. It is worth noting that anonymised data is exempt from the draft Indian law, as well as other data protection laws like the EU GDPR. In light of this, the bar of what constitutes ‘anonymisation’ assumes significance.

Constraints of data anonymisation

At the outset, the Indian government may have overlooked a fundamental tension between complete anonymisation and the utility of a dataset. The Article 29 Working Party has observed the difficulty in creating a perfectly anonymous dataset that simultaneously retains underlying information useful for a certain task. In fact, on studying the inverse relationship between utility and privacy, Prof. Paul Ohm has argued that data can be either useful or perfectly anonymous but never both. For example, a perfectly anonymised dataset for a labour study may just contain data on wages, without providing any relevant insight on gender or ethnic discrimination, or on salary gaps between individuals living in different regions.

Moreover, the robust anonymisation assumption neglects the residual risks that endure even in anonymised data. The mere removal of directly identifiable information does not mean that identification is no longer possible, as anonymisation techniques have deficiencies and may rapidly become obsolete. The potential of an individual being indirectly identified through singling out, linkability and inference hence persists.

Studies have shown that easy and accurate re-identification can be done, even with incomplete datasets, or datasets devoid of personally identifiable information. This may be done through machine learning or linking anonymised data with external information like electoral rolls, advocates roll lists and CBSE results.

In other words, some anonymisation techniques would only yield desirable results if there were no other data known about the individual. This is unlikely in the present age of pervasive data collection. For example, anonymised abortion data released by the government may be linked to publicly available data like an electoral roll to identify an individual. A reliable connection is enough to pose a privacy risk. Lessons learnt earlier in the AOL public data release and the Netflix prize competition should serve as reminders.

Harms

Anonymised data is usually not subject to data protection regulations and can be freely shared with third parties like advertising companies and data brokers. The Economic Survey itself contemplates selling anonymised datasets to the private sector for boosting data analytics. Further processing of poorly anonymised data may reveal sensitive personal data about individuals, and even allow the creation of comprehensive profiles about individuals.

This could have personal, professional and financial implications for an individual thus re-identified, not to mention the potential risk to public order or democracy, occurring when specific categories of sensitive data, including political opinions, sexual preferences or health information about individuals are openly disclosed. The release of anonymised data could also lead to the identification and potential discrimination of a group of people. Such instances may fall within a grey area of the law, where identification could be a privacy risk without tantamounting to a data protection risk.

Policy responses

Policymakers in India should be aware of the constraints of data anonymization. Instead of perceiving it as a privacy panacea to open data, a cautious attitude that seeks to effectively protect privacy is desirable. Without underestimating the risks of re-identification, adequate norms on what constitutes genuinely anonymous data should be formulated by the DPA, and routinely updated, to consider the impact of technological evolutions on the obsolescence of existing anonymisation techniques.

Nonetheless, this is not an easy task. The Srikrishna Committee in its report has advised against laying down a prescriptive standard as to what constitutes anonymisation. Therefore – instead of a “release and forget” approach – a contextual approach to carefully design the application of anonymisation techniques to a particular case is necessary. Such efforts should be based on the principle that different types of anonymised data pose varying degrees of re-identification risks.

True anonymisation may require additional efforts to prevent future identification, keeping in mind the context and purpose(s) for which the anonymised data shall be processed. It would be able to ensure that no one can single out an individual in a dataset, by linking data or inferring information. Additional measures such as regulatory scrutiny of public or semi-public data releases, risk assessment to weigh the benefits of open data against possible privacy harms, conducting re-identification testing and making a trusted third party anonymise personal data are worth considering.

It is not wise for the movement supporting open data to treat data anonymisation as a one-off incident that permanently erases privacy concerns. A refined approach to anonymisation will energise India’s recent strides in privacy and personal data protection.

This article was prepared by the author in her personal capacity. All views expressed are personal and do not represent the views of any entity whatsoever with which the author has been, is, or will be affiliated.