🤖 AI Summary
This work addresses a critical limitation in existing speaker anonymization systems, which inadvertently reduce the anonymity set and compromise privacy by preserving non-native accents. To mitigate this, the authors propose the first streaming-compatible accent neutralization mechanism that decouples accent from speaker identity under low-latency constraints. The approach integrates silence-aware dynamic time warping alignment, zero-shot voice conversion, and a causal accent translator, jointly trained with cross-entropy and CTC losses, requiring at most 40 ms of future context. Experimental results demonstrate that the method reduces non-native accent confidence by 81%, significantly improves intelligibility scores in listening tests, and effectively lowers speaker linkability, achieving end-to-end latency below 241 ms on a single GPU.
📝 Abstract
Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.