Probing Human Articulatory Constraints in End-to-End TTS with Reverse and Mismatched Speech-Text Directions

πŸ“… 2026-02-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates whether anatomical constraints of the human vocal apparatus influence the training and performance of end-to-end text-to-speech (TTS) systems. To this end, the authors introduce, for the first time, systematic training strategies employing reversed and mismatched text–speech pairs, comparing forward and reversed training configurations on both Tacotron-2 and VITS architectures. Experimental results demonstrate that models trained with reversed text and reversed speech consistently outperform conventional forward-trained systems in terms of speech fidelity, perceptual intelligibility, and naturalness. These findings reveal a high sensitivity of TTS models to the physiological constraints inherent in human articulation, offering a novel perspective on the role of biological priors in neural speech synthesis.

Technology Category

Application Category

πŸ“ Abstract
An end-to-end (e2e) text-to-speech (TTS) system is a deep architecture that learns to associate a text string with acoustic speech patterns from a curated dataset. It is expected that all aspects associated with speech production, such as phone duration, speaker characteristics, and intonation among other things are captured in the trained TTS model to enable the synthesized speech to be natural and intelligible. Human speech is complex, involving smooth transitions between articulatory configurations (ACs). Due to anatomical constraints, some ACs are challenging to mimic or transition between. In this paper, we experimentally study if the constraints imposed by human anatomy have an implication on training an e2e-TTS systems. We experiment with two e2e-TTS architectures, namely, Tacotron-2 an autoregressive model and VITS-TTS a non-autoregressive model. In this study, we build TTS systems using (a) forward text, forward speech (conventional, e2e-TTS), (b) reverse text, reverse speech (r-e2e-TTS), and (c) reverse text, forward speech (rtfs-e2e-TTS). Experiments demonstrate that e2e-TTS systems are purely data-driven. Interestingly, the generated speech by r-e2e-TTS systems exhibits better fidelity, better perceptual intelligibility, and better naturalness
Problem

Research questions and friction points this paper is trying to address.

articulatory constraints
end-to-end TTS
speech production
anatomical constraints
text-to-speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end TTS
articulatory constraints
reverse speech-text alignment
Tacotron-2
VITS-TTS
πŸ”Ž Similar Papers
No similar papers found.