Automatic Evaluation of Disclosure Risks of Text Anonymization Methods

Manzanares-Salor, Benet; Sánchez, David; Lison, Pierre

dc.contributor.author	Manzanares-Salor, Benet
dc.contributor.author	Sánchez, David
dc.contributor.author	Lison, Pierre
dc.date.accessioned	2023-03-03T16:02:30Z
dc.date.available	2023-03-03T16:02:30Z
dc.date.created	2022-10-11T13:49:37Z
dc.date.issued	2022
dc.identifier.isbn	978-3-031-13944-4
dc.identifier.uri	https://hdl.handle.net/11250/3055870
dc.description.abstract	The standard approach to evaluate text anonymization methods consists of comparing their outcomes with the anonymization performed by human experts. The degree of privacy protection attained is then measured with the IR-based recall metric, which expresses the proportion of re-identifying terms that were correctly detected by the anonymization method. However, the use of recall to estimate the degree of privacy protection suffers from several limitations. The first is that it assigns a uniform weight to each re-identifying term, thereby ignoring the fact that some missed re-identifying terms may have a larger influence on the disclosure risk than others. Furthermore, IR-based metrics assume the existence of a single gold standard annotation. This assumption does not hold for text anonymization, where several maskings (each one encompassing a different combination of terms) could be equally valid to prevent disclosure. Finally, those metrics rely on manually anonymized datasets, which are inherently subjective and may be prone to various errors, omissions and inconsistencies. To tackle these issues, we propose an automatic re-identification attack for (anonymized) texts that provides a realistic assessment of disclosure risks. Our method follows a similar premise as the well-known record linkage methods employed to evaluate anonymized structured data, and leverages state-of-the-art deep learning language models to exploit the background knowledge available to potential attackers. We also report empirical evaluations of several well-known methods and tools for text anonymization. Results show significant re-identification risks for all methods, including also manual anonymization efforts.	en_US
dc.description.abstract	Automatic Evaluation of Disclosure Risks of Text Anonymization Methods	en_US
dc.language.iso	eng	en_US
dc.relation.ispartof	Privacy in Statistical Databases
dc.rights	Navngivelse-Ikkekommersiell-DelPåSammeVilkår 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/deed.no	*
dc.title	Automatic Evaluation of Disclosure Risks of Text Anonymization Methods	en_US
dc.title.alternative	Automatic Evaluation of Disclosure Risks of Text Anonymization Methods	en_US
dc.type	Chapter	en_US
dc.description.version	acceptedVersion	en_US
cristin.ispublished	true
cristin.fulltext	postprint
cristin.qualitycode	1
dc.identifier.cristin	2060503
dc.relation.project	Norges forskningsråd: 308904	en_US

Files in this item

Name:: PSD2022_paper_7.pdf
Size:: 277.1Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Publikasjoner fra Cristin [288]
Vitenskapelige tidsskriftartikler og konferanseartikler med fagfellevurdering (NVI-kategori) [219]
Vitenskapelige tidsskriftartikler og konferanseartikler med fagfellevurdering (NVI-kategori)

Show simple item record

Except where otherwise noted, this item's license is described as Navngivelse-Ikkekommersiell-DelPåSammeVilkår 4.0 Internasjonal