The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Pilán, I; Lison, P; Ovrelid, L; Papadopoulou, A; Sánchez, D; Batet, M

doi:10.1162/coli_a_00458

Datos identificativos

Identificador: imarina:9287297

Handle: https://hdl.handle.net/20.500.11797/imarina9287297

Autores: Pilán, I; Lison, P; Ovrelid, L; Papadopoulou, A; Sánchez, D; Batet, M

Resumen:
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anony-mization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com /NorskRegnesentral/text-anonymization-benchmark.
Otros:

Enlace a la fuente original: https://direct.mit.edu/coli/article/48/4/1053/112770/The-Text-Anonymization-Benchmark-TAB-A-Dedicated
Referencia de l'ítem segons les normes APA: Pilán, I; Lison, P; Ovrelid, L; Papadopoulou, A; Sánchez, D; Batet, M (2022). The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization. COMPUTATIONAL LINGUISTICS, 48(4), 1053-1101. DOI: 10.1162/coli_a_00458
Referencia al articulo segun fuente origial: COMPUTATIONAL LINGUISTICS. 48 (4): 1053-1101
DOI del artículo: 10.1162/coli_a_00458
Año de publicación de la revista: 2022-12-01
Entidad: Universitat Rovira i Virgili
Versión del articulo depositado: info:eu-repo/semantics/publishedVersion
Fecha de alta del registro: 2026-05-09
Autor/es de la URV: Batet Sanromà, Montserrat / Sánchez Ruenes, David
Departamento: Enginyeria Informàtica i Matemàtiques
URL Documento de licencia: https://repositori.urv.cat/ca/proteccio-de-dades/
Tipo de publicación: Journal Publications
Autor según el artículo: Pilán, I; Lison, P; Ovrelid, L; Papadopoulou, A; Sánchez, D; Batet, M
Acceso a la licencia de uso: https://creativecommons.org/licenses/by/3.0/es/
Áreas temáticas: Linguistics and language, Linguistics, Language and linguistics, Language & linguistics, Filologia, lingüística i sociolingüística, Filología lingüística y sociolingüística, Computer science, interdisciplinary applications, Computer science, artificial intelligence, Computer science applications, Ciencias sociales, Ciencias humanas, Ciência da computação, Astronomia / física, Artificial intelligence, Applied linguistics
Direcció de correo del autor: montserrat.batet@urv.cat, montserrat.batet@urv.cat, david.sanchez@urv.cat, david.sanchez@urv.cat, montserrat.batet@urv.cat

Palabras clave:

Peace
justice and strong institutions
Applied Linguistics
Artificial Intelligence
Computer Science Applications
Computer Science
Interdisciplinary Applications
Language & Linguistics
Language and Linguistics
Linguistics
Linguistics and Language
Filologia
lingüística i sociolingüística
Filología lingüística y sociolingüística
Ciencias sociales
Ciencias humanas
Ciência da computação
Astronomia / física
Documentos:

DocumentPrincipal
Cerca a google

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Datos identificativos

Otros:

Palabras clave:

Documentos:

Cerca a google