PHONOS: PHOnetic Neutralization for Online Streaming Applications

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing speaker anonymization systems, which inadvertently reduce the anonymity set and compromise privacy by preserving non-native accents. To mitigate this, the authors propose the first streaming-compatible accent neutralization mechanism that decouples accent from speaker identity under low-latency constraints. The approach integrates silence-aware dynamic time warping alignment, zero-shot voice conversion, and a causal accent translator, jointly trained with cross-entropy and CTC losses, requiring at most 40 ms of future context. Experimental results demonstrate that the method reduces non-native accent confidence by 81%, significantly improves intelligibility scores in listening tests, and effectively lowers speaker linkability, achieving end-to-end latency below 241 ms on a single GPU.
📝 Abstract
Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.
Problem

Research questions and friction points this paper is trying to address.

speaker anonymization
accent neutralization
non-native accent
online streaming
privacy
Innovation

Methods, ideas, or system contributions that make the work stand out.

accent neutralization
streaming speaker anonymization
zero-shot voice conversion
causal accent translation
silence-aware DTW
🔎 Similar Papers
No similar papers found.
W
Waris Quamer
Department of Computer Science & Engineering, Texas A&M University, College Station, US
M
Mu-Ruei Tseng
Department of Computer Science & Engineering, Texas A&M University, College Station, US
G
Ghady Nasrallah
Department of Computer Science & Engineering, Texas A&M University, College Station, US
Ricardo Gutierrez-Osuna
Ricardo Gutierrez-Osuna
Texas A&M University, Computer Science and Engineering
Speech generationdigital healthwearable sensorsmachine learningchemometrics