Articles producció científicaEnginyeria Informàtica i Matemàtiques

Utility-preserving sanitization of semantically correlated terms in textual documents

  • Dades identificatives

    Identificador:  imarina:6387811
    Autors:  Sanchez, David; Batet, Montserrat; Viejo, Alexandre
    Resum:
    Traditionally, redaction has been the method chosen to mitigate the privacy issues related to the declassification of textual documents containing sensitive data. This process is based on removing sensitive words in the documents prior to their release and has the undesired side effect of severely reducing the utility of the content. Document sanitization is a recent alternative to redaction, which avoids utility issues by generalizing the sensitive terms instead of eliminating them. Some (semi-)automatic redaction/sanitization schemes can be found in the literature; however, they usually neglect the importance of semantic correlations between the terms of the document, even though these may disclose sanitized/redacted sensitive terms. To tackle this issue, this paper proposes a theoretical framework grounded in the Information Theory, which offers a general model capable of measuring the disclosure risk caused by semantically correlated terms, regardless of the fact that they are proposed for removal or generalization. The new method specifically focuses on generating sanitized documents that retain as much utility (i.e., semantics) as possible while fulfilling the privacy requirements. The implementation of the method has been evaluated in a practical setting, showing that the new approach improves the output's utility in comparison to the previous work, while retaining a similar level of accuracy. © 2014 Elsevier Inc. All rights reserved.
  • Altres:

    Autor segons l'article: Sanchez, David; Batet, Montserrat; Viejo, Alexandre
    Departament: Enginyeria Informàtica i Matemàtiques
    Autor/s de la URV: Batet Sanromà, Montserrat / SANCHEZ CERVELLÓ, DOMINGO JOSÉ / Sánchez Ruenes, David / Viejo Galicia, Luis Alexandre
    Paraules clau: Data privacy; Document redaction; Document sanitization; Information theory; Semantic knowledge
    Resum: Traditionally, redaction has been the method chosen to mitigate the privacy issues related to the declassification of textual documents containing sensitive data. This process is based on removing sensitive words in the documents prior to their release and has the undesired side effect of severely reducing the utility of the content. Document sanitization is a recent alternative to redaction, which avoids utility issues by generalizing the sensitive terms instead of eliminating them. Some (semi-)automatic redaction/sanitization schemes can be found in the literature; however, they usually neglect the importance of semantic correlations between the terms of the document, even though these may disclose sanitized/redacted sensitive terms. To tackle this issue, this paper proposes a theoretical framework grounded in the Information Theory, which offers a general model capable of measuring the disclosure risk caused by semantically correlated terms, regardless of the fact that they are proposed for removal or generalization. The new method specifically focuses on generating sanitized documents that retain as much utility (i.e., semantics) as possible while fulfilling the privacy requirements. The implementation of the method has been evaluated in a practical setting, showing that the new approach improves the output's utility in comparison to the previous work, while retaining a similar level of accuracy. © 2014 Elsevier Inc. All rights reserved.
    Àrees temàtiques: Administração pública e de empresas, ciências contábeis e turismo; Artificial intelligence; Astronomia / física; Biodiversidade; Ciência da computação; Ciências agrárias i; Ciências ambientais; Ciências biológicas i; Ciencias sociales; Computer science applications; Computer science, information systems; Comunicação e informação; Control and systems engineering; Engenharias i; Engenharias iii; Engenharias iv; Ensino; Information systems and management; Interdisciplinar; Matemática / probabilidade e estatística; Medicina ii; Software; Theoretical computer science
    Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
    Adreça de correu electrònic de l'autor: alexandre.viejo@urv.cat; david.sanchez@urv.cat; montserrat.batet@urv.cat
    ISSN: 00200255
    Data d'alta del registre: 2025-02-08
    Versió de l'article dipositat: info:eu-repo/semantics/acceptedVersion
    Enllaç font original: https://www.sciencedirect.com/science/article/abs/pii/S0020025514004009?via%3Dihub
    Referència a l'article segons font original: Information Sciences. 279 77-93
    Referència de l'ítem segons les normes APA: Sanchez, David; Batet, Montserrat; Viejo, Alexandre (2014). Utility-preserving sanitization of semantically correlated terms in textual documents. Information Sciences, 279(), 77-93. DOI: 10.1016/j.ins.2014.03.103
    URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
    DOI de l'article: 10.1016/j.ins.2014.03.103
    Entitat: Universitat Rovira i Virgili
    Any de publicació de la revista: 2014
    Tipus de publicació: Journal Publications
  • Paraules clau:

    Artificial Intelligence,Computer Science Applications,Computer Science, Information Systems,Control and Systems Engineering,Information Systems and Management,Software,Theoretical Computer Science
    Data privacy
    Document redaction
    Document sanitization
    Information theory
    Semantic knowledge
    Administração pública e de empresas, ciências contábeis e turismo
    Artificial intelligence
    Astronomia / física
    Biodiversidade
    Ciência da computação
    Ciências agrárias i
    Ciências ambientais
    Ciências biológicas i
    Ciencias sociales
    Computer science applications
    Computer science, information systems
    Comunicação e informação
    Control and systems engineering
    Engenharias i
    Engenharias iii
    Engenharias iv
    Ensino
    Information systems and management
    Interdisciplinar
    Matemática / probabilidade e estatística
    Medicina ii
    Software
    Theoretical computer science
    00200255
  • Documents:

  • Cerca a google

    Search to google scholar