Articles producció científicaEnginyeria Informàtica i Matemàtiques

Utility-Preserving Privacy Protection of Textual Documents via Word Embeddings

  • Identification data

    Identifier:  imarina:9242167
    Authors:  Hassan, Fadi; Sanchez, David; Domingo-Ferrer, Josep
    Abstract:
    A great variety of mechanisms have been proposed to protect structured databases with numerical and categorical attributes; however, little attention has been devoted to unstructured textual data. Textual data protection requires first detecting sensitive pieces of text and then masking those pieces via suppression or generalization. Current solutions rely on classifiers that can recognize a fixed set of (allegedly sensitive) named entities. Yet, such approaches fall short of providing adequate protection because in reality references to sensitive information are not limited to a predefined set of entity types, and not all the appearances of certain entity type result in disclosure. In this work we propose a more general and flexible based on the notion of word embedding. By means of word embeddings we build vectors that numerically capture the semantic relationships of the textual terms. Then we evaluate the disclosure caused by the terms on the entity to be protected according to the similarity between their vector representations. Our method also preserves the semantics (and, therefore, the utility) of the document by replacing risky terms with privacy-preserving generalizations. Empirical results show that our approach offers much more robust protection and greater utility preservation than methods based on named entity recognition. IEEE
  • Others:

    Link to the original source: https://ieeexplore.ieee.org/document/9419784
    APA: Hassan, Fadi; Sanchez, David; Domingo-Ferrer, Josep (2023). Utility-Preserving Privacy Protection of Textual Documents via Word Embeddings. Ieee Transactions On Knowledge And Data Engineering, 35(1), 1058-1071. DOI: 10.1109/TKDE.2021.3076632
    Paper original source: Ieee Transactions On Knowledge And Data Engineering. 35 (1): 1058-1071
    Article's DOI: 10.1109/TKDE.2021.3076632
    Journal publication year: 2023
    Entity: Universitat Rovira i Virgili
    Paper version: info:eu-repo/semantics/acceptedVersion
    Record's date: 2024-10-12
    URV's Author/s: Domingo Ferrer, Josep / Hassan, Fadi Abdulfattah Mohammed / Sánchez Ruenes, David
    Department: Enginyeria Informàtica i Matemàtiques
    Licence document URL: https://repositori.urv.cat/ca/proteccio-de-dades/
    Publication Type: Journal Publications
    Author, as appears in the article.: Hassan, Fadi; Sanchez, David; Domingo-Ferrer, Josep
    licence for use: https://creativecommons.org/licenses/by/3.0/es/
    Thematic Areas: Interdisciplinar, Information systems, Engineering, electrical & electronic, Computer science, information systems, Computer science, artificial intelligence, Computer science applications, Computational theory and mathematics, Ciência da computação
    Author's mail: david.sanchez@urv.cat, josep.domingo@urv.cat
  • Keywords:

    Word embeddings
    Vector representations
    Training data
    Textual documents
    Structured database
    Sensitive informations
    Semantics
    Semantic relationships
    Redaction
    Privacy protection
    Privacy preserving
    Privacy by design
    Natural language processing systems
    Named entity recognition
    Manuals
    Hidden markov models
    Embeddings
    Databases
    Data protection
    Data models
    Categorical attributes
    Computational Theory and Mathematics
    Computer Science Applications
    Computer Science
    Artificial Intelligence
    Information Systems
    Engineering
    Electrical & Electronic
    Interdisciplinar
    Ciência da computação
  • Documents:

  • Cerca a google

    Search to google scholar