DarkStream: real-time speech anonymization with low latency

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the privacy challenge of simultaneously achieving low latency and strong anonymity in real-time voice communication, this paper proposes a streaming voice anonymization framework. Methodologically, it employs a causal waveform encoder with minimal lookahead buffering (<20 ms), integrates a lightweight context-aware Transformer for speech content modeling, generates spoofed speaker embeddings via GANs, and directly synthesizes anonymous waveforms using an end-to-end neural vocoder. Our key contribution is the first realization of content–identity disentangled encoding and high-fidelity waveform reconstruction under ultra-low-latency constraints. Experiments under lazy informed attacks show an equal error rate (EER) of 49.8% for speaker verification and only 8.7% word error rate (WER) for ASR—substantially outperforming prior methods. The framework achieves strong anonymity, high intelligibility, and real-time performance with an end-to-end latency of <60 ms.

Technology Category

Application Category

📝 Abstract
We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. To improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. To further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Finally, DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication.
Problem

Research questions and friction points this paper is trying to address.

Real-time speaker anonymization with low latency
Improving content encoding under strict latency constraints
Anonymizing speaker identity while maintaining intelligibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines causal encoder with lookahead buffer for low latency
Generates waveforms directly via neural vocoder for efficiency
Injects GAN-generated pseudo-speaker embedding for anonymization
🔎 Similar Papers
No similar papers found.