Articles producció científica> Enginyeria Informàtica i Matemàtiques

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

  • Dades identificatives

    Identificador: imarina:9287297
    Autors:
    Pilan, IldikoLison, Pierreovrelid, LiljaPapadopoulou, AnthiSanchez, DavidBatet, Montserrat
    Resum:
    We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anony-mization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com /NorskRegnesentral/text-anonymization-benchmark.
  • Altres:

    Autor segons l'article: Pilan, Ildiko; Lison, Pierre; ovrelid, Lilja; Papadopoulou, Anthi; Sanchez, David; Batet, Montserrat
    Departament: Enginyeria Informàtica i Matemàtiques
    Autor/s de la URV: Batet Sanromà, Montserrat / Sánchez Ruenes, David
    Paraules clau: Peace, justice and strong institutions
    Resum: We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anony-mization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com /NorskRegnesentral/text-anonymization-benchmark.
    Àrees temàtiques: Linguistics and language Linguistics Language and linguistics Language & linguistics Filologia, lingüística i sociolingüística Computer science, interdisciplinary applications Computer science, artificial intelligence Computer science applications Ciencias sociales Ciencias humanas Ciência da computação Artificial intelligence Applied linguistics
    Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
    Adreça de correu electrònic de l'autor: montserrat.batet@urv.cat david.sanchez@urv.cat
    Identificador de l'autor: 0000-0001-8174-7592 0000-0001-7275-7887
    Data d'alta del registre: 2024-11-23
    Versió de l'article dipositat: info:eu-repo/semantics/publishedVersion
    URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
    Referència a l'article segons font original: Computational Linguistics. 48 (4): 1053-1101
    Referència de l'ítem segons les normes APA: Pilan, Ildiko; Lison, Pierre; ovrelid, Lilja; Papadopoulou, Anthi; Sanchez, David; Batet, Montserrat (2022). The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization. Computational Linguistics, 48(4), 1053-1101. DOI: 10.1162/coli_a_00458
    Entitat: Universitat Rovira i Virgili
    Any de publicació de la revista: 2022
    Tipus de publicació: Journal Publications
  • Paraules clau:

    Applied Linguistics,Artificial Intelligence,Computer Science Applications,Computer Science, Artificial Intelligence,Computer Science, Interdisciplinary Applications,Language & Linguistics,Language and Linguistics,Linguistics,Linguistics and Language
    Peace, justice and strong institutions
    Linguistics and language
    Linguistics
    Language and linguistics
    Language & linguistics
    Filologia, lingüística i sociolingüística
    Computer science, interdisciplinary applications
    Computer science, artificial intelligence
    Computer science applications
    Ciencias sociales
    Ciencias humanas
    Ciência da computação
    Artificial intelligence
    Applied linguistics
  • Documents:

  • Cerca a google

    Search to google scholar