🤖 AI Summary
Speech gender conversion risks speaker privacy leakage, particularly in reference-free settings where gender-specific acoustic cues persist in the output. To address this, we propose a reference-free adversarial gender obfuscation framework: a gender-conditioned adversarial learning architecture jointly disentangles phonetic content from gender-related representations, while explicit regularization aligns fundamental frequency distributions and formant trajectories to learn gender-neutral acoustic embeddings from balanced training data. Crucially, our method eliminates gender cues without requiring target-speaker references, preserving speech intelligibility and naturalness. Experiments under a semi-informed attack model demonstrate that our approach significantly outperforms existing methods—reducing gender identification accuracy by over 40%—while achieving a Mean Opinion Score (MOS) of 4.1 for speech quality. This work thus achieves a strong trade-off between rigorous privacy protection and high-fidelity speech reconstruction.
📝 Abstract
Sex conversion in speech involves privacy risks from data collection and often leaves residual sex-specific cues in outputs, even when target speaker references are unavailable. We introduce RASO for Reference-free Adversarial Sex Obfuscation. Innovations include a sex-conditional adversarial learning framework to disentangle linguistic content from sex-related acoustic markers and explicit regularisation to align fundamental frequency distributions and formant trajectories with sex-neutral characteristics learned from sex-balanced training data. RASO preserves linguistic content and, even when assessed under a semi-informed attack model, it significantly outperforms a competing approach to sex obfuscation.