Autor segons l'article: Hassan, Fadi; Sanchez, David; Domingo-Ferrer, Josep
Departament: Enginyeria Informàtica i Matemàtiques
Autor/s de la URV: Domingo Ferrer, Josep / Hassan, Fadi Abdulfattah Mohammed / Sánchez Ruenes, David
Paraules clau: Word embeddings Vector representations Training data Textual documents Structured database Sensitive informations Semantics Semantic relationships Redaction Privacy protection Privacy preserving Privacy by design Natural language processing systems Named entity recognition Manuals Hidden markov models Embeddings Databases Data protection Data models Categorical attributes word embeddings textual documents redaction named entity recognition
Resum: A great variety of mechanisms have been proposed to protect structured databases with numerical and categorical attributes; however, little attention has been devoted to unstructured textual data. Textual data protection requires first detecting sensitive pieces of text and then masking those pieces via suppression or generalization. Current solutions rely on classifiers that can recognize a fixed set of (allegedly sensitive) named entities. Yet, such approaches fall short of providing adequate protection because in reality references to sensitive information are not limited to a predefined set of entity types, and not all the appearances of certain entity type result in disclosure. In this work we propose a more general and flexible based on the notion of word embedding. By means of word embeddings we build vectors that numerically capture the semantic relationships of the textual terms. Then we evaluate the disclosure caused by the terms on the entity to be protected according to the similarity between their vector representations. Our method also preserves the semantics (and, therefore, the utility) of the document by replacing risky terms with privacy-preserving generalizations. Empirical results show that our approach offers much more robust protection and greater utility preservation than methods based on named entity recognition. IEEE
Àrees temàtiques: Interdisciplinar Information systems Engineering, electrical & electronic Computer science, information systems Computer science, artificial intelligence Computer science applications Computational theory and mathematics Ciência da computação
Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
Adreça de correu electrònic de l'autor: david.sanchez@urv.cat josep.domingo@urv.cat
Identificador de l'autor: 0000-0001-7275-7887 0000-0001-7213-4962
Data d'alta del registre: 2024-10-12
Versió de l'article dipositat: info:eu-repo/semantics/acceptedVersion
Enllaç font original: https://ieeexplore.ieee.org/document/9419784
URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
Referència a l'article segons font original: Ieee Transactions On Knowledge And Data Engineering. 35 (1): 1058-1071
Referència de l'ítem segons les normes APA: Hassan, Fadi; Sanchez, David; Domingo-Ferrer, Josep (2023). Utility-Preserving Privacy Protection of Textual Documents via Word Embeddings. Ieee Transactions On Knowledge And Data Engineering, 35(1), 1058-1071. DOI: 10.1109/TKDE.2021.3076632
DOI de l'article: 10.1109/TKDE.2021.3076632
Entitat: Universitat Rovira i Virgili
Any de publicació de la revista: 2023
Tipus de publicació: Journal Publications