Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses accent normalization—converting non-native speech into native-like pronunciation while preserving speaker identity. Methodologically, it proposes an end-to-end framework leveraging self-supervised discrete phoneme tokens: wav2vec 2.0 is employed to extract discrete phoneme sequences; a dedicated token-level conversion model is designed; and two duration-preserving mechanisms are introduced to enhance speech naturalness and timbre fidelity. During waveform synthesis, flow matching enables high-fidelity reconstruction. Crucially, the approach requires no parallel data and generalizes to unseen non-target accents. Experiments across multiple English accents demonstrate substantial improvements over frame-wise baselines, achieving simultaneous gains in naturalness, accent reduction, and speaker similarity. Phoneme-level analysis further validates the efficacy of discrete token modeling for accent normalization.

Technology Category

Application Category

📝 Abstract

Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.

Problem

Research questions and friction points this paper is trying to address.

Convert foreign-accented speech to native-like speech

Use self-supervised tokens with non-parallel data

Preserve speaker identity and improve speech naturalness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-supervised discrete tokens

Employs non-parallel training data

Synthesizes output via flow matching

🔎 Similar Papers

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion