Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses accent normalization—converting non-native speech into native-like pronunciation while preserving speaker identity. Methodologically, it proposes an end-to-end framework leveraging self-supervised discrete phoneme tokens: wav2vec 2.0 is employed to extract discrete phoneme sequences; a dedicated token-level conversion model is designed; and two duration-preserving mechanisms are introduced to enhance speech naturalness and timbre fidelity. During waveform synthesis, flow matching enables high-fidelity reconstruction. Crucially, the approach requires no parallel data and generalizes to unseen non-target accents. Experiments across multiple English accents demonstrate substantial improvements over frame-wise baselines, achieving simultaneous gains in naturalness, accent reduction, and speaker similarity. Phoneme-level analysis further validates the efficacy of discrete token modeling for accent normalization.

Technology Category

Application Category

📝 Abstract
Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.
Problem

Research questions and friction points this paper is trying to address.

Convert foreign-accented speech to native-like speech
Use self-supervised tokens with non-parallel data
Preserve speaker identity and improve speech naturalness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-supervised discrete tokens
Employs non-parallel training data
Synthesizes output via flow matching