🤖 AI Summary
To address the privacy challenge of simultaneously achieving low latency and strong anonymity in real-time voice communication, this paper proposes a streaming voice anonymization framework. Methodologically, it employs a causal waveform encoder with minimal lookahead buffering (<20 ms), integrates a lightweight context-aware Transformer for speech content modeling, generates spoofed speaker embeddings via GANs, and directly synthesizes anonymous waveforms using an end-to-end neural vocoder. Our key contribution is the first realization of content–identity disentangled encoding and high-fidelity waveform reconstruction under ultra-low-latency constraints. Experiments under lazy informed attacks show an equal error rate (EER) of 49.8% for speaker verification and only 8.7% word error rate (WER) for ASR—substantially outperforming prior methods. The framework achieves strong anonymity, high intelligibility, and real-time performance with an end-to-end latency of <60 ms.
📝 Abstract
We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. To improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. To further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Finally, DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication.