User data data anonymization

Is anyone aware of good practices in anonymization/pseudonymization of user data? Question from a researcher in the OSCARS project. Thank you for your help.

This might be something for the ACE K-centre: ACE | CLARIN ERIC

Thank you, Dieter. Their helpdesk is under construction. I will contact Henk.

You should perhaps make it clear to henk that, in this case, it is textual datan not audio or video, that the researcher is dealing with.

@Henk If you are on the forum, please let us know if you are aware of best practices in pseudonymizing user text data. Thank you.

Henk replied:
Anonymization and pseudonymization

Researchers are encouraged to anonymize the personal data in their research project as much and as soon as possible, as this effectively removes the personal data from the dataset and significantly reduces the privacy risks. If the researcher chooses to anonymize research data, then this should be detailed in the informed consent. A dataset is considered anonymous if it is no longer possible to identify any of the participants in any way. To be considered anonymous, a dataset must not contain any direct identifiers that can be linked to participants in the data, metadata, or documentation. Furthermore, the risk of identification via combinations of indirect identifiers or via combinations of different datasets must be sufficiently reduced to make re-identification effectively impossible. When a dataset is anonymized, it is not considered personal data and therefore not subject to the requirements of the GDPR.

According to the General Data Protection Regulation (2018), an individual is directly identifiable if you can identify them using nothing but the information you possess. This would for example be a name, photo, video or audio recording.

According to the General Data Protection Regulation (2018), an individual is indirectly identifiable when you are able to identify a person by using other information you hold or information you can reasonably access from another source. Research data can be archived or published under the conditions specified in the informed consent. Researchers must therefore plan under which conditions they intend to archive or publish the data prior to acquiring informed consent, so that they can inform and obtain consent from their participants about data archiving and publication. The General Data Protection Regulation (2018) advises that, to ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. Using these means it may be possible to identify someone, for example through the singling out of individuals, the linking of records or the inference of information.

The General Data Protection Regulation (2018) describes pseudonymization as the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.

Pseudonymization, while not removing the personal data, is an alternative to anonymization by which the direct privacy risks of a data breach can be reduced. Researchers should be aware however that pseudonymization keeps the link between personal data and the participant intact, meaning that pseudonymized data are still considered personal data and thus subject to the requirements of the GDPR.

Whenever it is impossible to effectively anonymize or pseudonymize a dataset, for example when this would result in excessive loss of relevant information, the identifiable research data must be archived in such a manner that it is kept secure. This means that the research data must be archived or published under the conditions specified in the informed consent. Researchers must therefore plan under which conditions they intend to archive or publish the data prior to acquiring informed consent, so that they can inform and obtain consent from their participants about data archiving and publication.

Pseudonymization – basic steps

  1. In the data management plan, describe why and how you’re going to pseudonymize data, how access to the separately stored key file and the dataset is regulated and what happens to the key file and the datawhen the project is completed.

  2. Identify the following categories in your data:

• Data necessary for identification, to organize research or to communicate with research participants » Store these in a key file in a private channel of your Teams environment with limited access or store in an encrypted file in your workgroup folder.

• Data required for analysis » Preferably stored in the data management system that you use such as the Workgroup folders, Teams or RDR.

• Data not needed (e.g. in case of a supplied dataset) » This data should be deleted.

  1. Pseudonymize the data as quickly as possible, i.e. immediately when collecting data. If you are sent a dataset with identifiable data by another party, pseudonymize the data immediately after receiving it.

  2. Use different pseudonyms for different datasets. This prevents that data from participants who feature in multiple datasets can be linked via the pseudonym.

  3. Limit access to the key file but ensure that within the organization there is always someone who can have access.

For more information, see: https://www.ru.nl/en/staff/researchers/research-data/data-anonymisation

Obviously, anonymization is not necessary for non-experimental data not involving living participants, such as historical data about persons, locations, and events. In all cases, if a researcher intends to publish personal data, s/he should check if sharing is allowed without violating privacy or Intellectual Property Regulations (IPR) regulations. This is also relevant when researchers use pre-existing dataarchived in libraries or repositories.