Content Leakage in LibriSpeech and Its Impact on the Privacy Evaluation of Speaker Anonymization

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses a critical yet overlooked issue in speech privacy evaluation: the lexical distinctiveness of read speech in the LibriSpeech dataset can inadvertently leak speaker identity, thereby compromising the assessment of voice anonymization systems. The authors demonstrate for the first time that linguistic content itself serves as a source of identity leakage. Through lexical distribution analysis and speaker identification experiments, they compare LibriSpeech with the more representative EdAcc dataset. Their findings reveal that a majority of speakers in LibriSpeech can be identified based solely on vocabulary usage, whereas this effect is substantially diminished in EdAcc. This result underscores EdAcc’s superior suitability for fair and comprehensive evaluation of anonymization systems’ privacy-preserving capabilities.

Technology Category

Application Category

📝 Abstract

Speaker anonymization aims to conceal a speaker's identity, without considering the linguistic content. In this study, we reveal a weakness of Librispeech, the dataset that is commonly used to evaluate anonymizers: the books read by the Librispeech speakers are so distinct, that speakers can be identified by their vocabularies. Even perfect anonymizers cannot prevent this identity leakage. The EdAcc dataset is better in this regard: only a few speakers can be identified through their vocabularies, encouraging the attacker to look elsewhere for the identities of the anonymized speakers. EdAcc also comprises spontaneous speech and more diverse speakers, complementing Librispeech and giving more insights into how anonymizers work.

Problem

Research questions and friction points this paper is trying to address.

content leakage

speaker anonymization

LibriSpeech

privacy evaluation

vocabulary-based identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

content leakage

speaker anonymization

LibriSpeech