🤖 AI Summary
This work addresses the insufficient syntactic sensitivity of text-to-speech (TTS) systems in prosodic phrase boundary prediction—particularly for syntactically ambiguous constructions such as garden-path sentences—where models over-rely on punctuation and neglect latent syntactic cues. To rigorously assess TTS models’ syntactic awareness, we introduce a psycholinguistic evaluation paradigm. We propose a punctuation-agnostic fine-tuning strategy that compels models to infer implicit syntactic structure. Our methodology integrates controlled fine-tuning of pretrained TTS models, construction of a syntactically annotated prosodic boundary dataset, development of a human-validated prosody labeling protocol, and design of a contrastive ambiguity analysis framework. Results demonstrate significantly improved syntactic consistency in prosodic boundary placement for complex sentences: fine-tuned models better reflect constituent-level syntactic structure and markedly reduce punctuation dependency. This work provides both a novel methodological framework and empirical evidence for enhancing TTS naturalness and alignment with linguistic structure.
📝 Abstract
We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.