You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses speech intelligibility challenges faced by second-language (L2) learners by proposing the first vowel-duration–based customized clear speech synthesis method. Unlike conventional time-stretching approaches, our method extends the Matcha-TTS framework to explicitly model English tense/lax vowel duration contrasts, thereby enabling prosodically controlled clear speech synthesis via a dedicated “clear mode.” Crucially, it incorporates L2 learners’ perceptual sensitivity to vowel duration into TTS optimization. Evaluation combines human perception experiments with Whisper-ASR–based automatic assessment. Experiments with French-native English learners demonstrate that the clear mode reduces transcription error rate by over 9.15%, significantly improves subjective intelligibility, and—critically—achieves these gains without participants consciously detecting the modification, confirming substantial implicit intelligibility enhancement. These results reveal a dissociation between objective intelligibility metrics and subjective perceptual judgments.

Technology Category

Application Category

📝 Abstract

We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.

Problem

Research questions and friction points this paper is trying to address.

Develops L2-tailored TTS using vowel duration for clarity

Reduces transcription errors for L2 listeners by 9.15%

Reveals mismatch between actual and perceived speech intelligibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

TTS system tailored for L2 speakers

Durational vowel properties enhance clarity

Clarity mode reduces transcription errors

🔎 Similar Papers

No similar papers found.

Authors to Follow