🤖 AI Summary
This study addresses speech intelligibility challenges faced by second-language (L2) learners by proposing the first vowel-duration–based customized clear speech synthesis method. Unlike conventional time-stretching approaches, our method extends the Matcha-TTS framework to explicitly model English tense/lax vowel duration contrasts, thereby enabling prosodically controlled clear speech synthesis via a dedicated “clear mode.” Crucially, it incorporates L2 learners’ perceptual sensitivity to vowel duration into TTS optimization. Evaluation combines human perception experiments with Whisper-ASR–based automatic assessment. Experiments with French-native English learners demonstrate that the clear mode reduces transcription error rate by over 9.15%, significantly improves subjective intelligibility, and—critically—achieves these gains without participants consciously detecting the modification, confirming substantial implicit intelligibility enhancement. These results reveal a dissociation between objective intelligibility metrics and subjective perceptual judgments.
📝 Abstract
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.