A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the insufficient syntactic sensitivity of text-to-speech (TTS) systems in prosodic phrase boundary prediction—particularly for syntactically ambiguous constructions such as garden-path sentences—where models over-rely on punctuation and neglect latent syntactic cues. To rigorously assess TTS models’ syntactic awareness, we introduce a psycholinguistic evaluation paradigm. We propose a punctuation-agnostic fine-tuning strategy that compels models to infer implicit syntactic structure. Our methodology integrates controlled fine-tuning of pretrained TTS models, construction of a syntactically annotated prosodic boundary dataset, development of a human-validated prosody labeling protocol, and design of a contrastive ambiguity analysis framework. Results demonstrate significantly improved syntactic consistency in prosodic boundary placement for complex sentences: fine-tuned models better reflect constituent-level syntactic structure and markedly reduce punctuation dependency. This work provides both a novel methodological framework and empirical evidence for enhancing TTS naturalness and alignment with linguistic structure.

Technology Category

Application Category

📝 Abstract

We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.

Problem

Research questions and friction points this paper is trying to address.

Analyzing syntactic sensitivity in TTS intonational phrasing

Identifying TTS struggles with ambiguous syntactic boundaries

Improving intonation patterns via fine-tuning on comma-free sentences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyze TTS syntactic sensitivity psycholinguistically

Finetune models using syntactic boundary sentences

Enhance intonation patterns with linguistic cues

🔎 Similar Papers

No similar papers found.

Authors to Follow