Utility-preserving sanitization of semantically correlated terms in textual documents

Sánchez, D; Batet, M; Viejo, A

doi:10.1016/j.ins.2014.03.103

Dades identificatives

Identificador: imarina:6387811

Handle: https://hdl.handle.net/20.500.11797/imarina6387811

Autors: Sánchez, D; Batet, M; Viejo, A

Resum:
Traditionally, redaction has been the method chosen to mitigate the privacy issues related to the declassification of textual documents containing sensitive data. This process is based on removing sensitive words in the documents prior to their release and has the undesired side effect of severely reducing the utility of the content. Document sanitization is a recent alternative to redaction, which avoids utility issues by generalizing the sensitive terms instead of eliminating them. Some (semi-)automatic redaction/sanitization schemes can be found in the literature; however, they usually neglect the importance of semantic correlations between the terms of the document, even though these may disclose sanitized/redacted sensitive terms. To tackle this issue, this paper proposes a theoretical framework grounded in the Information Theory, which offers a general model capable of measuring the disclosure risk caused by semantically correlated terms, regardless of the fact that they are proposed for removal or generalization. The new method specifically focuses on generating sanitized documents that retain as much utility (i.e., semantics) as possible while fulfilling the privacy requirements. The implementation of the method has been evaluated in a practical setting, showing that the new approach improves the output's utility in comparison to the previous work, while retaining a similar level of accuracy. © 2014 Elsevier Inc. All rights reserved.
Altres:

Enllaç font original: https://www.sciencedirect.com/science/article/abs/pii/S0020025514004009?via%3Dihub
Referència de l'ítem segons les normes APA: Sánchez, D; Batet, M; Viejo, A (2014). Utility-preserving sanitization of semantically correlated terms in textual documents. Information Sciences, 279(), 77-93. DOI: 10.1016/j.ins.2014.03.103
Referència a l'article segons font original: Information Sciences. 279 77-93
DOI de l'article: 10.1016/j.ins.2014.03.103
Any de publicació de la revista: 2014-09-20
Entitat: Universitat Rovira i Virgili
Versió de l'article dipositat: info:eu-repo/semantics/acceptedVersion
Data d'alta del registre: 2026-05-09
Autor/s de la URV: Batet Sanromà, Montserrat / SANCHEZ CERVELLÓ, DOMINGO JOSÉ / Sánchez Ruenes, David / Viejo Galicia, Luis Alexandre
Departament: Enginyeria Informàtica i Matemàtiques
URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
Tipus de publicació: Journal Publications
ISSN: 00200255
Autor segons l'article: Sánchez, D; Batet, M; Viejo, A
Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
Àrees temàtiques: Theoretical computer science, Software, Information systems and management, Control and systems engineering, Computer science, information systems, Computer science applications, Ciencias sociales, Ciência da computação, Astronomia / física, Artificial intelligence
Adreça de correu electrònic de l'autor: montserrat.batet@urv.cat, montserrat.batet@urv.cat, david.sanchez@urv.cat, david.sanchez@urv.cat, alexandre.viejo@urv.cat, alexandre.viejo@urv.cat, montserrat.batet@urv.cat

Paraules clau:

Semantic knowledge
Information theory
Document sanitization
Document redaction
Data privacy
Artificial Intelligence
Computer Science Applications
Computer Science
Information Systems
Control and Systems Engineering
Information Systems and Management
Software
Theoretical Computer Science
Ciencias sociales
Ciência da computação
Astronomia / física
Documents:

DocumentPrincipal
Cerca a google

Utility-preserving sanitization of semantically correlated terms in textual documents

Dades identificatives

Altres:

Paraules clau:

Documents:

Cerca a google