π€ AI Summary
This work addresses the significant degradation in automatic speech recognition (ASR) performance caused by dysarthric speech, which exhibits abnormal prosody and pronounced speaker variability. To mitigate this, the authors propose a dysarthria-aware rhythmβstyle speech synthesis framework built upon the Matcha-TTS architecture. The framework incorporates a multi-stage rhythm predictor and a pathological style-conditioned flow-matching mechanism to jointly model temporal rhythm and acoustic style, further enhanced by contrastive preference optimization to improve prosodic reconstruction accuracy. Evaluated on the TORGO dataset, the synthesized speech achieves a Mel-cepstral distortion (MCD) of 4.29, and when integrated with Whisper, reduces the ASR word error rate by 54.22% relative to baseline systems, substantially enhancing the intelligibility of dysarthric speech.
π Abstract
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.