๐ค AI Summary
This study addresses a critical yet overlooked issue in speech privacy evaluation: the lexical distinctiveness of read speech in the LibriSpeech dataset can inadvertently leak speaker identity, thereby compromising the assessment of voice anonymization systems. The authors demonstrate for the first time that linguistic content itself serves as a source of identity leakage. Through lexical distribution analysis and speaker identification experiments, they compare LibriSpeech with the more representative EdAcc dataset. Their findings reveal that a majority of speakers in LibriSpeech can be identified based solely on vocabulary usage, whereas this effect is substantially diminished in EdAcc. This result underscores EdAccโs superior suitability for fair and comprehensive evaluation of anonymization systemsโ privacy-preserving capabilities.
๐ Abstract
Speaker anonymization aims to conceal a speaker's identity, without considering the linguistic content. In this study, we reveal a weakness of Librispeech, the dataset that is commonly used to evaluate anonymizers: the books read by the Librispeech speakers are so distinct, that speakers can be identified by their vocabularies. Even perfect anonymizers cannot prevent this identity leakage. The EdAcc dataset is better in this regard: only a few speakers can be identified through their vocabularies, encouraging the attacker to look elsewhere for the identities of the anonymized speakers. EdAcc also comprises spontaneous speech and more diverse speakers, complementing Librispeech and giving more insights into how anonymizers work.