Articles producció científica> Enginyeria Informàtica i Matemàtiques

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

  • Datos identificativos

    Identificador: imarina:9385341
    Autores:
    Manzanares-Salor, BenetSanchez, DavidLison, Pierre
    Resumen:
    The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.
  • Otros:

    Autor según el artículo: Manzanares-Salor, Benet; Sanchez, David; Lison, Pierre
    Departamento: Enginyeria Informàtica i Matemàtiques
    Autor/es de la URV: Manzanares Salor, Benet / Sánchez Ruenes, David
    Palabras clave: Text anonymization Record linkage Re-identification risk Privacy-preserving data publishing Privac Language models Language model De-identification
    Resumen: The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.
    Áreas temáticas: Information systems Engenharias iv Engenharias iii Computer science, information systems Computer science, artificial intelligence Computer science applications Computer networks and communications Ciências biológicas i Ciência da computação
    Acceso a la licencia de uso: https://creativecommons.org/licenses/by/3.0/es/
    Direcció de correo del autor: benet.manzanares@urv.cat david.sanchez@urv.cat
    Identificador del autor: 0000-0001-7275-7887
    Fecha de alta del registro: 2025-03-15
    Versión del articulo depositado: info:eu-repo/semantics/publishedVersion
    URL Documento de licencia: https://repositori.urv.cat/ca/proteccio-de-dades/
    Referencia al articulo segun fuente origial: Data Mining And Knowledge Discovery. 38 (6): 4040-4075
    Referencia de l'ítem segons les normes APA: Manzanares-Salor, Benet; Sanchez, David; Lison, Pierre (2024). Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack. Data Mining And Knowledge Discovery, 38(6), 4040-4075. DOI: 10.1007/s10618-024-01066-3
    Entidad: Universitat Rovira i Virgili
    Año de publicación de la revista: 2024
    Tipo de publicación: Journal Publications
  • Palabras clave:

    Computer Networks and Communications,Computer Science Applications,Computer Science, Artificial Intelligence,Computer Science, Information Systems,Information Systems
    Text anonymization
    Record linkage
    Re-identification risk
    Privacy-preserving data publishing
    Privac
    Language models
    Language model
    De-identification
    Information systems
    Engenharias iv
    Engenharias iii
    Computer science, information systems
    Computer science, artificial intelligence
    Computer science applications
    Computer networks and communications
    Ciências biológicas i
    Ciência da computação
  • Documentos:

  • Cerca a google

    Search to google scholar