They who must not be identified—distinguishing personal from non-personal data under the GDPR

Michèle Finck, Frank Pallas, They who must not be identified—distinguishing personal from non-personal data under the GDPR, International Data Privacy Law, Volume 10, Issue 1, February 2020, Pages 11–36, https://doi.org/10.1093/idpl/ipz026

Navbar Search Filter Mobile Enter search term Search Navbar Search Filter Enter search term Search Key Points

One year after the European Union’s (EU) General Data Protection Regulation (hereafter ‘GDPR’ or ‘the Regulation’) became binding, uncertainty continues to surround the definition of some of its core concepts, including that of personal data. Drawing a dividing line between personal data and non-personal data is, however, paramount to determine the scope of application of European data protection law. Whereas personal (including ‘pseudonymous’) data is subject to the Regulation, non-personal data is not. Determining whether a given data item qualifies as personal data is thus crucial, and increasingly burdensome as more data are being generated and shared.

Notwithstanding the pivotal importance of the distinction between personal and non-personal data, it can, in practice, be extremely burdensome to differentiate between both categories. This difficulty is anchored in both technical and legal factors. From a technical perspective, the increasing availability of data points as well as the continuing sophistication of data analysis algorithms and performant hardware makes it easier to link datasets and infer personal information from ostensibly non-personal data. From a legal perspective, it is at present not obvious what the correct legal test is that should be applied to categorize data under the GDPR.

Recital 26 GDPR announces that data is anonymous if it is ‘reasonably likely’ that it cannot be linked to an identified or identifiable natural person. National supervisory authorities and the Article 29 Working Party (the ‘A29WP’ which is now the European Data Protection Board—‘EDPB’) have, however, provided interpretations of the concept that conflict with this legislative text. It will indeed be seen below that whereas Recital 26 GDPR embodies a test based on the respective risk of identification, the Working Party has developed a parallel test that considers that there can be no remaining risk of identification for data to qualify as anonymous data. Notwithstanding, anonymization is an important concept from the perspective of other notions and requirements in European data protection law, such as that of data minimization. The difficult determination of what constitutes a ‘reasonable likelihood’ of identification further burdens practitioners’ work. Beyond, the explicit inclusion of the new concept of pseudonymous data in the Regulation has confused some observers.

This article charts the resulting entanglements from an interdisciplinary perspective. It evaluates the GDPR’s definition of personal and non-personal data from a computer science and legal perspective by proceeding as follows. First, the legal concepts of personal and non-personal data are introduced through an analysis of the legislative text and its interpretation by different supervisory authorities. Secondly, we introduce the technical perspective on modifying personal data to remove person-relatedness. The third section applies the preceding insights in looking at practical examples of blockchain use cases. A concluding section thereafter builds on previous insights in engaging with the risk-management nature of the GDPR. It will be seen that, contrary to what has been maintained by some, perfect anonymization is impossible and that the legal definition thereof needs to embrace the remaining risk.

The legal definition of personal data under the GDPR

The GDPR only applies to personal data, meaning that non-personal data falls outside its scope of application. The definition of personal data is hence an element of primordial significance as it determines whether an entity processing data is subject to the various obligations that the Regulation imposes on data controllers. This underlines that the definition of personal data is far from merely being of theoretical interest. Rather, the contours of the concepts of personal and non-personal data are of central practical significance to almost anyone processing data. Notwithstanding, ‘[w]hat constitutes personal data is one of the central causes of doubt’ in the current data protection regime. 1

The Regulation adopts a binary approach that differentiates between personal data and non-personal data and subjects only the former to its scope of application. 2 In contrast with this binary legal perspective, reality operates on a spectrum between data that is clearly personal, data that is clearly anonymous and anything in between. 3 Today, much economic value is derived from data that is not personal on its face but can be rendered personal if sufficient effort is put in place. Beyond, there is an ongoing debate as to whether and if so under which circumstances personal data can be manipulated to become anonymous. Indeed, whereas some data can be anonymous data from the beginning (such as climatic sensor data with no link to natural persons), other data may at some point be personal data but then be successfully manipulated to no longer relate to an identified or identifiable natural person. This underscores that the classification of personal data is dynamic. Depending on context, the same data point can be personal or non-personal and hence be subject to the Regulation or not.

This section introduces three causes of the bewildered definition of personal data. First, there is doubt regarding the appropriate legal test to be applied. Secondly, technical developments are further complicating this definitional exercise. Thirdly, the introduction of an explicit legal category of ‘pseudonymous’ data in the GDPR has induced confusion.

Personal data

Article 4(1) GDPR defines personal data as:

any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. 4

Personal data is hence data that directly or indirectly relates to an identified or identifiable natural person. The Article 29 Working Party has issued guidance on how the four constituent elements of the test in Article 4(1) GDPR—‘any information’, ‘relating to’, ‘an identified or identifiable’, and ‘natural person’—ought to be interpreted. 5

Information ought to be construed broadly, and includes objective information (such as a name or the presence of a given substance in one’s blood) as well as subjective analysis such as information, opinions, and assessments. 6 The European Court of Justice (ECJ) has, however, clarified in the meantime that whereas information contained in the application for a residence permit and data contained in legal analysis qualify as personal data, related legal analysis (the assessment) does not. 7 Personal data can moreover take any form and be alphabetical or numerical data, videos, and pictures. 8 Note, moreover, that Article 4(1) GDPR refers to ‘information’ rather than just data, indicating that the data appears to require some informational value. Of course, the distinction between information and data is not always easy to draw.

Data is considered to be relating to a data subject ‘when it is about that individual’. 9 This includes information that is in an individual’s file but also vehicle data that reveals information about the data subject. 10 An individual is considered to be identified or identifiable where they can be ‘distinguished’ from others. 11 This does not require that the individual can be identified by a name, rather she could also be identified through alternative means such as a telephone number. 12 This underlines that the concept of personal data ought to be interpreted broadly, a stance that has been embraced by the Court time and time again. It held in Nowak that the expression ‘any information’ reflects ‘the aim of the EU legislature to assign a wide scope to that concept, which is not restricted to information that is sensitive or private, but potentially encompasses all kinds of information, not only objective but also subjective’. 13 In Digital Rights Ireland, the ECJ established that metadata (such as location data or IP addresses combined with log files on retrieved web pages) which only allows for the indirect identification of the data subject can nonetheless be personal data as it makes it possible ‘to know the identity of the person with whom a subscriber or registered user has communicated and by what means, and to identify the time of the communication as well as the place from which that communication took place’. 14 Finally, Article 4(1) GDPR underlines that personal data must relate to a natural person. The GDPR does not apply to legal persons or the deceased. 15

The above overview has underlined that the concept of personal data ought to be interpreted broadly. Yet, not all data constitute personal data.

Differentiating between personal data and non-personal data

The European data protection framework acknowledges two categories of data: personal and non-personal data. There is data that is always non-personal (because it never related to an identified or identifiable natural person) and there is also data that once was personal but no longer is (as linkage to a natural person has been removed). The legal test to differentiate between personal and non-personal data is embodied in Recital 26 GDPR according to which:

[p]ersonal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

Data not caught by this test falls outside the scope of European data protection law. Indeed, Recital 26 GDPR goes on to state that:

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

Pursuant to the GDPR data is hence personal when the controller or another person is able to identify the data subject by using the ‘means reasonably likely to be used’. Where personal data never related to a natural person or is no longer reasonably likely to be attributed to a natural person, it qualifies as ‘anonymous information’ and eschews the Regulation’s scope of application. Figure 1 depicts the test to be applied to determine whether information constitutes personal data.

The test devised by Recital 26 GDPR essentially embraces a risk-based approach to qualify information. 16 Where there is a reasonable risk of identification, data ought to be treated as personal data. Where that risk is merely negligent, data can be treated as non-personal data, and this even though identification cannot be excluded with absolute certainty. A closer look reveals, however, that some of the elements of this test suffer from a lack of clarity, resulting in particular from contrasting interpretations by various supervisory authorities.

Assessment scheme for person-relatedness of data under the GDPR

Assessment scheme for person-relatedness of data under the GDPR

Making sense of the various elements of Recital 26 GDPR

Although Recital 26 GDPR appears to embrace a straightforward approach to distinguish between personal and non-personal data, in practice, it has often proven difficult to implement. This becomes obvious when dividing the overall test embodied in the GDPR into its various constituent elements.

What risk? The uncertain standard of identifiability

Recital 26 GDPR formulates a risk-based approach to determine whether data is personal in nature or not. Where identification is ‘reasonably likely’ to occur, personal data is in play, where this is not the case the information in question is non-personal. Some national supervisory authorities have embraced interpretations of the GDPR that largely appear in line with this risk-based approach. The UK Information Commissioner s Office (ICO), for instance, adopts a relativist understanding of Recital 26 GDPR, stressing that the relevant criterion is ‘the identification or likely identification’ of a data subject. 17 This acknowledges that ‘the risk of re-identification through data linkage is essentially unpredictable because it can never be assessed with certainty what data is already available or what data may be released in the future’. 18 The Irish Data Protection Authority (DPA) deems that it is not ‘necessary to prove that it is impossible for the data subject to be identified in order for an anonymisation technique to be successful. Rather, if it can be shown that it is unlikely that a data subject will be identified given the circumstances of the individual case and the state of technology, the data can be considered anonymous’. 19

In its 2014 guidelines on anonymization and pseudonymization, the Article 29 Working Party, however, adopted a different stance. On the one hand, the Working Party acknowledges the Regulation’s risk-based approach. 20 On the other hand, it, however, appears to devise its own independent zero-risk test. Its guidelines announce that ‘anonymisation results from processing personal data in order to irreversibly prevent identification’. 21 Similarly, the guidance document announces that ‘anonymisation is a technique applied to personal data in order to achieve irreversible de-identification ’. 22 This strict position is in line with earlier guidance from 2007 according to which anonymized data is data ‘that previously referred to an identifiable person, but where that identification is no longer possible’. 23 This means that ‘the outcome of anonymisation as a technique applied to personal data should be, in the current state of technology, as permanent as erasure, i.e. making it impossible to process personal data’. 24

What is more, the Working Party considers that ‘when a data controller does not delete the original (identifiable) data at event-level, and the data controller hands over part of this dataset (for example after removal or masking of identifiable data), the resulting dataset is still personal data’. 25 This has been criticized as there may well be scenarios where a controller wants to release anonymous data while needing to keep the original dataset, as would be the case where a hospital makes available anonymized data for research purposes while retaining the original data for patient care. 26 This in itself is a rejection of the risk-based approach as it considers the risk stemming from keeping the initial data to be intolerable. As Stalla-Bourdillon and Knight explain, the combination of A29WP emphasis on original dataset and wording of Recital 26 GDPR is ‘problematic since as the definition of pseudonymisation refers to both identified and identifiable data subjects the risk remains that data will be considered pseudonymised as long as the raw dataset has not been destroyed, even if the route of anonymisation through aggregation has been chosen’. 27 Beyond, the opinion also uses expressions that are difficult to make sense of such as ‘identification has become reasonably impossible’—although it is unclear what reasonably (a qualified term) impossible (an absolute term) could mean. 28

Compared to the risk-based approach of the GDPR, the Working Party thus appears to consider that no amount of risk can be tolerated. Indeed, the concepts of irreversibility, permanence, and impossibility stand for a much stricter approach than that formulated by the legislative text itself. Whereas Recital 26 acknowledges that anonymization can never be absolute (such as where technology changes over time), the Working Party’s absolutist stance indicates that anonymization ought to be permanent. These diverging interpretations have prevented legal certainty as to what test ought to be applied in practice. 29

The tension between the A29WP’s no-risk stance and the risk-based approach embraced by Recital 26 GDPR can also be identified in guidance released by national authorities. To the French Commission Nationale de l'Informatique et des Libertés (CNIL), anonymization consists in making ‘identification practically impossible’. It deems that anonymization ‘seeks to be irreversible’ so as to no longer permit the processing of personal data. 30 This reference to impossibility is more helpful as it clarifies that impossibility is the goal, yet also recognizes that it can be difficult to achieve in practice. To the French Conseil d’État, the highest national administrative court, data can, however, only be considered anonymous if the direct or indirect identification of the person becomes ‘impossible’, and this notwithstanding whether evaluated from the perspective of the data controller or a third person. 31 In contrast, the highest French civil court, the cour de cassation concluded that IP addresses are personal data without justifying why this is the case. 32 The Finnish Social Science Data Archive similarly considers that for the data to count as anonymous ‘anonymisation must be irreversible’. 33

What elements ought to be taken into account to determine whether anonymization has occurred?

Pursuant to Recital 26 GDPR, the relevant criterion to assess whether data is pseudonymous or anonymous is identifiability. 34 To determine whether an individual can be identified consideration ought to be given to ‘all means reasonably likely to be used’. This includes ‘all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments’. 35

In addition, the A29WP considers that three criteria ought to be considered to determine whether de-identification has occurred namely if (i) it is still possible to single out an individual; (ii) it is still possible to link records relating to an individual, and (iii) information concerning an individual can still be inferred. 36 Where the answer to these three questions is negative, data can be considered anonymous. It should be noted that while Recital 26 GDPR now makes explicit reference to ‘singling out’, inference and linkability are elements considered by the Working Party but not explicitly mentioned in the GDPR.

Singling out refers to ‘the possibility to isolate some or all records which identify an individual in the dataset’. 37 Linkability denotes the risk generated where at least two data sets contain information about the same data subject. If in such circumstances an ‘attacker can establish (e.g. by means of correlation analysis) that two records are assigned to a same group of individuals but cannot single out individuals in this group’, then the used technique only provides resistance against singling out but not against linkability. 38 Finally, inference may still be possible even where singling out and linkability are not. Inference has been defined by the Working Party as ‘the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes’. 39 The Working Party underlined that meeting these three thresholds is very difficult. 40 This is confirmed by its own analysis of the most commonly used ‘anonymisation’ techniques, which revealed that each method leaves a residual risk of identification so that, if at all, only a combination of different approaches can successfully de-personalize data. 41

What is the relevant time scale?

Recital 26 requires that the ‘means’ to be taken into account are not just those that are presently available, but also ‘technological developments’. It is, however, far from obvious what timescale ought to be considered in this respect. Recital 26 does not reveal whether one ought to account for ongoing technological changes (such as a new technique that has been rolled out across many sectors but not yet to the specific data controller or processor) or whether developments currently just explored in research should also be given consideration. To provide a concrete example, it is not obvious whether the still uncertain prospect of quantum computing should be factored in when determining whether a certain encryption technique could transform personal data into anonymous data. 42

The A29WP indicated that one should consider both ‘the state of the art in technology at the time of the processing’ as well as ‘the possibilities for development during the period for which the data will be processed’. In respect to the second scenario, the lifetime of the data is a key factor. Indeed, where data is to be kept for a decade, the data controller ‘should consider that possibility of identification that may occur also within the ninth year of their lifetime, and which may make them personal data at that moment’. 43 This indicates that the data in question only becomes personal information in the ninth year, yet from the beginning the controller must be aware of, and prepare for, that possibility. This highlights the GDPR’s nature as a risk-management framework, which is further explored in the concluding section.

Pursuant to the Working Party, the relevant system ‘should be able to adapt to these developments as they happen, and to incorporate the appropriate technical and organisational measures in due course’. 44 Indeed, the assumption is that data becomes personal data at the moment identification becomes possible. The relevant question appears to be whether a given dataset can be matched with other datasets from the perspective of availability rather than technical possibility. The risk that the entity in possession of the dataset may in the future acquire (access to) additional information that, in combination, may enable identification is accordingly not considered to legally qualify data in the present. This has been criticized as ‘the characterisation of anonymised data should also be dependent upon an ongoing monitoring on the part of the initial data controller of the data environment of the dataset that has undergone anonymisation’. 45 In fact, in times where data generation continues to accelerate, an entity may have access to a dataset that on its face is anonymous but might then, purposefully or not, subsequently gain access to a dataset containing information that enables re-identification. The resulting data protection risks are considerable. Yet, it is questionable how a test addressing this ex ante could be fashioned as there is often little way of predicting what data may be generated or acquired in the future. As such it might be more realistic to acknowledge data’s dynamic nature and that anonymous data becomes personal data as soon as identification becomes possible. In any event, data controllers have a monitoring obligation and must adopt technical and organizational measures in due course.

Personal data to whom?

To determine whether information constitutes personal data, it is important to know from whose perspective the likelihood of identification ought to be assessed. Recital 26 provides that to determine identifiability ‘account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly’. This formulation appears to indicate that it is not sufficient to evaluate identifiability from the perspective of the controller but potentially also any other third party.

The GDPR is a fundamental rights framework and the ECJ has time and time again emphasized the need to provide an interpretation thereof capable of ensuring the complete and effective protection of data subjects. From this perspective, it matters little from whose perspective data qualifies as personal data—anyone should protect the data subject’s rights. In the academic literature, there has long been a debate as to whether there is a need to only focus on the data controller (a relative approach) or any third party (an absolute approach). 46 Some have criticized the absolute approach, highlighting that the reference to ‘another person’ eliminates ‘the need for any risk management because it compels the data controller to always make the worst possible assumptions even if they are not relevant to the specific context’. 47

Some supervisory authorities appear to have embraced a half-way test between the absolute and relative approach. The ICO formulated the ‘motivated intruder’ test whereby companies should determine whether an intruder could achieve re-identification if motivated to attempt this. 48 The motivated intruder is assumed to be ‘reasonably competent’ and with access to resources such as the Internet, libraries, or all public documents but should not be assumed to have specialist knowledge such as hacking skills or to have access to ‘specialist equipment’. 49

In the European Courts, Breyer is the leading case on this matter. 50 Mr Breyer had accessed several websites of the German federal government that stored information regarding access operations in logfiles. 51 This information included the visitor’s dynamic IP address (an IP address that changes with every new connection to the Internet. Breyer argued that the storage of the IP address was in violation of his rights. The ECJ had already decided in Scarlet Extended that static IP addresses ‘are protected personal data because they allow those users to be precisely identified’. 52 Recital 30 GDPR also considers IP addresses as online identifiers.

The Court noted that the collection and identification of IP addresses was carried out by the Internet Service Provider (ISP), whereas in the case at issue the collection and identification of the IP address was carried out by the German federal government, which ‘registers IP addresses of the users of a website that it makes accessible to the public, without having the additional data necessary in order to identify those users’. 53 A dynamic IP address is not data related to an identified natural personal but can be considered to make log entries that relate to an identifiable person where the necessary additional data are held by the ISP. 54 The dynamic IP address accordingly qualified as personal data even though the data to identify Mr Breyer was not held by German authorities but by the ISP. 55

In isolation, this would imply that the nature of data ought not just to be evaluated from the perspective of the controller (German authorities; the relative approach) but also from the perspective of third parties (the ISP; the absolute approach). Indeed, ‘there is no requirement that all the information enabling the identification of the data subject must be in the hands of one person’. 56 However, that finding may have been warranted by the specific facts at issue. The Court stressed that whereas it is in principle prohibited under German law for the ISP to transmit such data to website operators, the government has the power to compel ISPs to do so in the event of a cyberattack. This is also an interesting statement as it implies that cyberattacks are events that are ‘reasonably likely’—arguably highlighting that the standard of reasonable likelihood to be applied is not very strict. Breyer hence confirms the risk-based approach in Recital 26 GDPR as the Court indeed evaluates the actual risk of identification.

The Breyer ruling also begs an additional question. Indeed, the Court’s emphasis on the legality (for the government only) of compelling ISPs to reveal the data necessary to re-personalize a de-personalized dataset was key to its conclusion. This makes us wonder, on the one hand, whether the illegality of an act that enables identification means that it should always be considered as reasonably unlikely.

This relativist approach to identifiability has been endorsed in other contexts as well. Some have argued in relation to cloud computing that ‘to the person encrypting personal data, such as a cloud user with the decryption key, the data remain “personal data”’. 57 In the Safe Harbor agreement, the Commission considered that the transfer of key-coded data to the USA is not a personal data export where the key was not revealed or transferred alongside the data. 58 Recently, an English court embraced a cautionary approach to Breyer, arguing that whereas the ECJ’s ruling depended ‘on specific factual aspects of the German legal system’, it should not be held that the mere fact that a party can use the law to gain access to data to ‘identify a natural person’ would make that procedure a ‘means reasonably likely to be used’. 59

The above would indicate that the perspective from which identifiability ought to be assessed is that of the data controller. In Breyer, Advocate General Campos Sánchez-Bordona warned that if the contrary perspective were adopted, it would never be possible to rule out with absolute certainty ‘that there is no third party in possession of additional data which may be combined with that information and are, therefore, capable of revealing a person’s identity’. 60

As a consequence, there is currently ‘a very significant grey area, where a data controller may believe a dataset is anonymised, but a motivated third party will still be able to identify at least some of the individuals from the information released’. 61 Research has moreover pointed out that where a data controller implements strategies to burden the re-identification of data, this does not mean that adversaries will be incapable of identifying the data, particularly since they might have a higher tolerance for inaccuracy as well as access to additional (possibly illegal) databases. 62 On the other hand, adopting an absolute approach could effectively rule out the existence of anonymous data as ultimately there will always be parties able to combine a dataset with additional information that may re-identify it.

An objective or subjective approach?

It is furthermore unclear from whose perspective the risk of identification ought to be evaluated. Recital 26 foresees that a ‘reasonable’ investment of time and financial resources should be considered to determine whether a specified natural person can be identified. There is, however, an argument to be made that what is a ‘reasonable’ depends heavily on context. The characterization of data is context-dependent, so that personalization ‘should not be seen as a property of the data but as a property of the environment of the data’. 63 It is indeed fair to assume that reasonableness ought to be evaluated differently depending on whether the entity concerned is a private person or a law enforcement agency or a major online platform. Whereas a case-by-case basis is in any event required, it is not obvious what standard of reasonableness ought to be applied, specifically whether the specific capacities of a relevant actor need to be factored in or not. Moreover, it is not clear whether an objective or subjective approach ought to be adopted. A subjective approach would require consideration of all factors within one’s knowledge—specifically who has access to relevant data that enables identification. An objective approach would, however, require a broader evaluation, including who has access to information in the present and who might gain access to relevant data in the future.

The Irish DPA suggested that it should first be considered who the potential intruder might be before determining what the methods reasonably likely to be used are. Furthermore, organizations ‘should also consider the sensitivity of the personal data, as well as its value to a potential intruder or any 3rd party who may gain access to the data’. 64 Indeed, when anonymized data is shared with the world at large, there is a higher burden to ensure effective anonymization as it is virtually impossible to retract publication once it becomes apparent that identification is possible. 65 With this in mind, it should be evaluated what other data these controllers have access to (such as public registers but also data available only to a particular individual or organization). 66 This appears to imply that all (known) data controllers need to be considered to determine the person-relatedness of a dataset. Restricting this exercise to known data controllers seems reasonable as it would be impossible for any party to exclude with absolute certainty that there is not another party able to identify allegedly anonymous data.

The UK ICO similarly considers that when anonymized data is released publicly, it is not only important to determine whether it is really anonymous data from the perspective of the controller releasing the data but also whether there are third parties that are likely to use prior knowledge to facilitate re-identification (such as a doctor that knows that an anonymized case study relates to one of her patients). 67 This indicates that what needs to be accounted for is the knowledge of third parties that could reasonably be expected to attempt to identify data, the subjective approach. Indeed, an absolute objective approach would present challenges as it would require much better knowledge of the wider world than a data controller typically has. A hospital releasing data that is ‘anonymised’ from its own perspective (such as for research purposes) cannot reasonably evaluate whether any other party in the world may have additional information allowing for identification. This is a particular challenge in open data contexts. Although those releasing the data may be confident that it is anonymous they cannot exclude with certainty whether other parties may be able to identify data subjects on the basis of additional knowledge they hold. An important open question that remains in this domain is thus from whose perspective the quality of data ought to be assessed: from the perspective of any third party or only of those third parties reasonably likely to make use of the additional information they have to proceed to re-personalize a de-personalized dataset. If the latter is the case, then the follow-on question becomes what parties can be considered reasonably likely to make use of such information and, moreover, whether presumed intent ought to be considered as a relevant factor here.

The purposes of data use

Finally, the A29WP stressed that when determining the nature of personal data, it is crucial to evaluate the ‘purpose pursued by the data controller in the data processing’. 68 Indeed, ‘to argue that individuals are not identifiable, where the purpose of processing is precisely to identify them, would be a sheer contradiction in terms’. 69 In the same vein, the French supervisory authority held that the accumulation of data held by Google that enables it to individualize persons is personal data as ‘the sole objective pursued by the company is to gather a maximum of details about individualised persons in an effort to boost the value of their profiles for advertising purposes’. 70 In line with this reasoning, public keys or other sorts of identifiers used to identify a natural person constitute personal data.

After having introduced the general uncertainties regarding the taxonomy of personal and anonymous data, it will now be seen that ongoing technical developments further burden the legal qualification of data.

Technical developments and the definition of personal data

With the advent of ever more performant data analysis techniques and hardware as well as the heightened availability of data points, it is becoming increasingly straightforward to relate data to natural persons. Some have observed that data protection law may become the ‘law of everything’ as in the near future all data may be personal data and thus subject to the GDPR. 71 This is so as ‘technology is rapidly moving towards perfect identifiability of information; datafication and advances in data analytics make everything (contain) information; and in increasingly ‘smart’ environments any information is likely to relate to a person in purpose or effect’. 72 The A29WP warned in the same vein that ‘anonymisation is increasingly difficult to achieve with the advance of modern computer technology and the ubiquitous availability of information’. 73

In light of these technical advancements, establishing the risk of re-identification can be difficult ‘where complex statistical methods may be used to match various pieces of anonymised data’. 74 Indeed ‘the possibility of linking several anonymised datasets to the same individual can be a precursor to identification’. 75 A particular difficulty here resides in the fact that it is often not known what datasets a given actor has access to, or might have access to in the future. The A29WP’s approach to anonymization has accordingly been criticized as ‘idealistic and impractical’. 76

Research has amply confirmed the difficulties of achieving anonymization, such as where an ‘anonymised’ profile can still be used to single out a specific individual. 77 Big data moreover facilitates the de-anonymization of data through the combination of various datasets. 78 It is accordingly often easy to identify data subjects on the basis of purportedly anonymized data. 79 Some computer scientists have even warned that the de-identification of personal data is an ‘unattainable goal’. 80 Recent research has confirmed that allegedly anonymous datasets often allow for the identification of specific natural persons as long as the dataset contains the person’s date of birth, gender, and postal code. 81

The language of anonymous data has been criticized as ‘the very use of a terminology that creates the illusion of a definitive and permanent contour that clearly delineates the scope of data protection laws’. 82 This, however, is not the case where even data that is anonymous on its face may be subsequently matched with other data points. Early examples for such re-personalization of datasets thought to be anonymous include the re-identification of publicly released health data using public voter lists 83 or the re-personalization of publicly released ‘anonymous’ data from a video streaming platform through inference with other data from a public online film review database. 84 More recent research suggests that 99.98 per cent of the population of a US state could be uniquely re-identified within a dataset consisting of 15 demographic factors. 85

In light of the above, it might be argued that the risk-based approach to anonymization enshrined in Recital 26 GDPR is the only sensible approach to distinguishing between personal and non-personal data. Indeed, in today’s complex data ecosystems, it can never be assumed that the anonymization of data is ‘as permanent as erasure’. Data circulates and is traded, new data sets are created, and third parties may be in possession of information allowing linkage, which the original data controller has no knowledge of. There are accordingly considerable complications in drawing the boundaries between personal and non-personal data. The GDPR now recognizes that when data is modified to decrease linkability, this does not necessarily result in anonymous but rather in pseudonymous data.

The concept of pseudonymous data under the GDPR

Article 4(5) GDPR introduces pseudonymization as the

processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person. 86

The concept of pseudonymization is one of the novelties of the GDPR compared to the 1995 Data Protection Directive. There is an ongoing debate regarding the implications of Article 4(5), in particular, whether the provision gives rise to a third category of data beyond those of personal and anonymous data. A literal interpretation reveals, however, that Article 4(5) GDPR deals with a method, not an outcome of data processing. 87 Pseudonymization is the ‘processing’ of personal data in such a way that data can only be attributed to a data subject with the help of additional information. This underlines that pseudonymized data remains personal data, in line with the Working Party’s finding that ‘pseudonymisation is not a method of anonymisation. It merely reduces the linkability of a dataset with the original identity of a data subject, and is accordingly a useful security measure.’ 88 Thus pseudonymous data is still ‘explicitly and importantly, personal data, but its processing is seen as presenting less risk to data subjects, and as such is given certain privileges designed to incentivise its use’. 89 The Irish supervisory authority concurs that pseudonymization ‘should never be considered an effective means of anonymisation’. 90

The GDPR explicitly encourages pseudonymization as a risk-management measure. Pseudonymization can serve as evidence of compliance with the controller’s security obligation under Article 5(f) GDPR and confirm that the data protection by design and by default requirements has been duly considered. 91 Recital 28 further provides that ‘[t]he application of pseudonymisation to personal data can reduce the risks to the data subjects concerned and help controllers and processors to meet their data-protection obligations.’ 92 According to Recital 29, pseudonymization is possible ‘within the same controller’ when that controller has taken appropriate technical and organizational measures. It is interesting to note that Recital 29 explicitly facilitates this in order to ‘create incentives to apply pseudonymisation when processing personal data’.

Pseudonymized data can, however, still be linked to natural persons. Recital 30 recalls that data subjects may be ‘associated with online identifiers provided by their devices, applications, tools and protocols, such as Internet protocol addresses, cookie identifiers or other identifiers’. 93 These enable identification when they leave traces which ‘in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them’. 94

It is worth stressing that even though pseudonymized data may fall short of qualifying as anonymized data, it may be caught by Article 11 GDPR, pursuant to which the controller is not obliged to maintain, acquire, or process additional information to identify the data subject in order to comply with the Regulation. 95 In such scenarios, the controller does not need to comply with Articles 15–20 GDPR unless the data subject provides additional information enabling their identification for the purposes of exercising their GDPR rights. 96

There is thus ample recognition that whereas pseudonymization serves as a valuable risk-minimization approach, it falls short of being an anonymization technique. Before the revision of data protection law through the GDPR, there was some confusion regarding the legal distinction between pseudonymization and anonymization. Some supervisory authorities considered that pseudonymization can produce anonymous data. 97 It has been suggested that this confusion may be rooted in the fact that in computer science pseudonyms are understood as ‘nameless’ identifiers and thus not necessarily anonymous data. 98 In any event, the GDPR is now unequivocal that pseudonymized data is still personal data. Interestingly, however, the GDPR only looks towards one specific method of identifier replacement—referred as ‘traditional pseudonymisation’ below—that uses additional, separately kept information to re-personalize pseudonymised data. The A29WP uses a different definition for pseudonymization, 99 only increasing terminological confusion around ‘pseudonymisation’.

Moreover, it is also worth noting that Article 4(5) GDPR may be read as considering that whenever there is additional data available (with the same controller?) that allows for the personalization of a de-personalized dataset, then this always amounts to personal data. Stated otherwise, a data controller is unable to anonymize data by separating a dataset that is de-personalized from a dataset that would enable re-personalization, even where the adoption of technical and organizational measures makes re-personalization reasonably unlikely. Given the pronounced practical relevance of that question, the adoption of regulatory guidance to specifiy whether this is in fact the case would be helpful.

Having laid out the legal foundations for determining whether a certain piece of data is to be considered personal data or non-personal data under the GDPR, we now move on to the technical dimension of anonymization and pseudonymization.

Technical approaches to identifier replacement

Different technical approaches can be used to remove explicit links to natural persons from data that differ regarding the possibilities of re-personalization and, in particular, the additional knowledge and resources (in the form of computational power) necessary to achieve re-personalization. They also differ with regard to the linkability of single data points within a dataset or across different datasets. We therefore present different established patterns of replacing explicit identifiers in datasets. Given the importance of re-identification to legally qualify data, we discuss these patterns particularly with respect to the means reasonably likely to be used test.

There are two different starting points for re-personalization, namely (1) re-identification starting from clear-text information, eg when we have a person’s ID and want to find all data points related to this ID from a set of de-personalized data and (2) re-identification starting from a de-personalized dataset, eg when we want to know the identities behind (all or some) data points matching certain criteria. We moreover have to distinguish between (i) identifier-based re-identification (learn the relation between a clear-text identifier and its obfuscated counterpart) and (ii) content-based re-identification (learn which person is behind an obfuscated ID based on the content—like motion profiles—linked to this obfuscated ID).

To illustrate, imagine a scenario where the transfer of goods and respective payments among different actors be tracked without revealing the parties’ identities. For this case, the four possible approaches for re-identification can be depicted as follows:

. 1. Start from known person . 2. Start from content .
A. ID-based re-identification‘Find all transactions that John Smith was involved in, based on his known ID’‘Find the persons involved in transaction X, based on known IDs of all persons to be considered’
B. Content-based re-identification‘Find all transactions that John Smith was involved in through matching transaction data with his known bank account history’‘Find the persons involved in transaction X through matching transaction data with bank account histories of all persons to be considered’