Articles producció científicaEnginyeria Informàtica i Matemàtiques

Automatic general-purpose sanitization of textual documents

  • Dades identificatives

    Identificador:  imarina:6387373
    Autors:  Sanchez, David; Batet, Montserrat; Viejo, Alexandre
    Resum:
    The advent of new information sharing technologies has led society to a scenario where thousands of textual documents are publicly published every day. The existence of confidential information in many of these documents motivates the use of measures to hide sensitive data before being published, which is precisely the goal of document sanitization. Even though methods to assist the sanitization process have been proposed, most of them are focused on the detection of specific types of sensitive entities for concrete domains, lacking generality and and requiring user supervision. Moreover, to hide sensitive terms, most approaches opt to remove them, a measure that hampers the utility of the sanitized document. This paper presents a general-purpose sanitization method that, based on information theory and exploiting knowledge bases, detects and hides sensitive textual information while preserving its meaning. Our proposal works in an automatic and unsupervised way and it can be applied to heterogeneous documents, which make it specially suitable for environments with massive and heterogeneous information-sharing needs. Evaluation results show that our method outperforms strategies based on trained classifiers regarding the detection recall, whereas it better retains the document's utility compared to term-suppression methods. © 2005-2012 IEEE.
  • Altres:

    Autor segons l'article: Sanchez, David; Batet, Montserrat; Viejo, Alexandre
    Departament: Enginyeria Informàtica i Matemàtiques
    Autor/s de la URV: Batet Sanromà, Montserrat / SANCHEZ CERVELLÓ, DOMINGO JOSÉ / Sánchez Ruenes, David / Viejo Galicia, Luis Alexandre
    Paraules clau: Data publishing; Document sanitization; Information theory; Privacy
    Resum: The advent of new information sharing technologies has led society to a scenario where thousands of textual documents are publicly published every day. The existence of confidential information in many of these documents motivates the use of measures to hide sensitive data before being published, which is precisely the goal of document sanitization. Even though methods to assist the sanitization process have been proposed, most of them are focused on the detection of specific types of sensitive entities for concrete domains, lacking generality and and requiring user supervision. Moreover, to hide sensitive terms, most approaches opt to remove them, a measure that hampers the utility of the sanitized document. This paper presents a general-purpose sanitization method that, based on information theory and exploiting knowledge bases, detects and hides sensitive textual information while preserving its meaning. Our proposal works in an automatic and unsupervised way and it can be applied to heterogeneous documents, which make it specially suitable for environments with massive and heterogeneous information-sharing needs. Evaluation results show that our method outperforms strategies based on trained classifiers regarding the detection recall, whereas it better retains the document's utility compared to term-suppression methods. © 2005-2012 IEEE.
    Àrees temàtiques: Ciência da computação; Computer networks and communications; Computer science, theory & methods; Engenharias iii; Engenharias iv; Engineering, electrical & electronic; Interdisciplinar; Safety, risk, reliability and quality
    Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
    Adreça de correu electrònic de l'autor: alexandre.viejo@urv.cat; david.sanchez@urv.cat; montserrat.batet@urv.cat
    ISSN: 15566013
    Data d'alta del registre: 2025-02-08
    Versió de l'article dipositat: info:eu-repo/semantics/acceptedVersion
    Enllaç font original: https://ieeexplore.ieee.org/document/6410029
    Referència a l'article segons font original: Ieee Transactions On Information Forensics And Security. 8 (6): 853-862
    Referència de l'ítem segons les normes APA: Sanchez, David; Batet, Montserrat; Viejo, Alexandre (2013). Automatic general-purpose sanitization of textual documents. Ieee Transactions On Information Forensics And Security, 8(6), 853-862. DOI: 10.1109/TIFS.2013.2239641
    URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
    DOI de l'article: 10.1109/TIFS.2013.2239641
    Entitat: Universitat Rovira i Virgili
    Any de publicació de la revista: 2013
    Tipus de publicació: Journal Publications
  • Paraules clau:

    Computer Networks and Communications,Computer Science, Theory & Methods,Engineering, Electrical & Electronic,Safety, Risk, Reliability and Quality
    Data publishing
    Document sanitization
    Information theory
    Privacy
    Ciência da computação
    Computer networks and communications
    Computer science, theory & methods
    Engenharias iii
    Engenharias iv
    Engineering, electrical & electronic
    Interdisciplinar
    Safety, risk, reliability and quality
    15566013
  • Documents:

  • Cerca a google

    Search to google scholar