Minimizing the disclosure risk of semantic correlations in document sanitization

Sánchez, D.; Batet, M.; Viejo, A.

doi:10.1016/j.ins.2013.06.042

Dades identificatives

Identificador: PC:400

Handle: https://hdl.handle.net/20.500.11797/PC400

Autors: Sánchez, D.; Batet, M.; Viejo, A.

Resum:
Text sanitization is crucial to enable privacy-preserving declassification of confidential documents. Moreover, considering the advent of new information sharing technologies that enable the daily publication of thousands of textual documents, automatic and semi-automatic sanitization methods are needed. Even though several of these methods have been proposed, most of them detect and sanitize sensitive terms (e.g., people names, addresses, diseases, etc.) independently, neglecting the importance of semantic correlations. From the attacker’s perspective, semantic correlations can be exploited to disclose a sanitized term from the presence of one or several non-sanitized words. To tackle this problem, this paper presents a general-purpose method that, by taking the output of a standard sanitization mechanism, analyses, detects and proposes for sanitization those semantically correlated terms that represent a plausible disclosure risk for the already sanitized ones. Our method relies on an information-theoretic formulation of disclosure risk which is able to adapt its behavior to the criterion of the initial sanitizer. The evaluation, carried on over a collection of real documents, shows that semantic correlations represent a real privacy threat in prior sanitized documents, and that our method is able to detect them effectively. As a result, the disclosure risk of the sanitized output is significantly reduced with respect to standard sanitization mechanisms.
Altres:

Enllaç font original: http://www.sciencedirect.com/science/article/pii/S0020025513004799
DOI de l'article: 10.1016/j.ins.2013.06.042
Any de publicació de la revista: 2013
Entitat: Universitat Rovira i Virgili.
Versió de l'article dipositat: info:eu-repo/semantics/submittedVersion
Pàgina inicial: 110
Departament: Enginyeria Informàtica i Matemàtiques
URL Document de llicència: https://repositori.urv.cat/ca/proteccio-de-dades/
Pàgina final: 123
ISSN: 0020-0255
Autor segons l'article: Sánchez, D., Batet, M., Viejo, A.
Accès a la llicència d'ús: https://creativecommons.org/licenses/by/3.0/es/
Volum de revista: 249

Paraules clau:

0020-0255
Documents:

DocumentPrincipal
Cerca a google

Minimizing the disclosure risk of semantic correlations in document sanitization

Dades identificatives

Altres:

Paraules clau:

Documents:

Cerca a google