🤖 AI Summary
This work addresses accent normalization—converting non-native speech into native-like pronunciation while preserving speaker identity. Methodologically, it proposes an end-to-end framework leveraging self-supervised discrete phoneme tokens: wav2vec 2.0 is employed to extract discrete phoneme sequences; a dedicated token-level conversion model is designed; and two duration-preserving mechanisms are introduced to enhance speech naturalness and timbre fidelity. During waveform synthesis, flow matching enables high-fidelity reconstruction. Crucially, the approach requires no parallel data and generalizes to unseen non-target accents. Experiments across multiple English accents demonstrate substantial improvements over frame-wise baselines, achieving simultaneous gains in naturalness, accent reduction, and speaker similarity. Phoneme-level analysis further validates the efficacy of discrete token modeling for accent normalization.
📝 Abstract
Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.