Adaptive Duration Model for Text Speech Alignment

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Neural text-to-speech (TTS) systems suffer from fragile phoneme–text alignment—especially for long sentences and out-of-domain text—leading to word omissions and repetitions. To address this, we propose a conditional adaptive phoneme duration distribution prediction framework. Unlike conventional approaches assuming fixed duration distributions (e.g., Gaussian), our method is the first to model duration distributions conditioned on textual input, enabling context-aware variability in phoneme timing. Integrated into a non-autoregressive end-to-end TTS architecture, it employs an attention-enhanced external duration predictor. Crucially, the framework demonstrates strong zero-shot robustness to audio–text mismatches—e.g., when prompted with inconsistent or noisy audio references. Experiments show a 11.3% absolute improvement in phoneme-level alignment accuracy, substantially mitigating omissions and repetitions. Moreover, the model exhibits significantly enhanced generalization across unseen domains and sentence structures compared to baseline models.

Technology Category

Application Category

📝 Abstract

Speech-to-text alignment is a critical component of neural text to-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive end to-end TTS models rely on durations extracted from external sources, using additional duration models for alignment. In this paper, we propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and condition adaptation ability compared to previous baseline models. Numerically, it has roughly a 11.3 percents immprovement on alignment accuracy, and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

Problem

Research questions and friction points this paper is trying to address.

Improving speech-to-text alignment in neural TTS models

Addressing brittle alignments in long and out-of-domain utterances

Enhancing duration prediction accuracy for zero-shot TTS robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel phoneme-level duration prediction framework

Precise prediction with condition adaptation

Improves alignment accuracy by 11.3%

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Meta Superintelligence Labs (PhD)