DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the significant degradation in automatic speech recognition (ASR) performance caused by dysarthric speech, which exhibits abnormal prosody and pronounced speaker variability. To mitigate this, the authors propose a dysarthria-aware rhythm–style speech synthesis framework built upon the Matcha-TTS architecture. The framework incorporates a multi-stage rhythm predictor and a pathological style-conditioned flow-matching mechanism to jointly model temporal rhythm and acoustic style, further enhanced by contrastive preference optimization to improve prosodic reconstruction accuracy. Evaluated on the TORGO dataset, the synthesized speech achieves a Mel-cepstral distortion (MCD) of 4.29, and when integrated with Whisper, reduces the ASR word error rate by 54.22% relative to baseline systems, substantially enhancing the intelligibility of dysarthric speech.

Technology Category

Application Category

📝 Abstract

Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.

Problem

Research questions and friction points this paper is trying to address.

Dysarthria

Automatic Speech Recognition

Prosody

Speaker Variability

Pathological Speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

dysarthria-aware synthesis

rhythm modeling

style conditional flow matching

TTS-based data augmentation

ASR enhancement

🔎 Similar Papers

No similar papers found.

Authors to Follow