Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Language model–based text-to-speech (TTS) systems suffer from pervasive speech hallucinations—spurious or ungrounded acoustic outputs inconsistent with input text. Method: This paper proposes a training-free post-hoc framework that introduces GFlowNet into TTS decoding for the first time. By modeling the distribution over generation trajectories, it formulates an enhanced subtrajectory balance objective, jointly incorporating sharpened internal reward signals and dynamic temperature annealing—all without incurring additional inference latency. Contribution/Results: The method requires no auxiliary training resources and effectively suppresses generation bias. Experiments on highly challenging test sets demonstrate over 50% reduction in character error rate (CER), up to 58% decrease in model uncertainty (measured via entropy or confidence calibration), and superior generalization and robustness across diverse speakers and prosodic conditions.

Technology Category

Application Category

📝 Abstract

Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in LM-based TTS models

Reducing character error rates without excessive resources

Aligning model distribution to minimize speech deviations

Innovation

Methods, ideas, or system contributions that make the work stand out.

GFlowNet-guided distribution alignment framework

Subtrajectory Balance objective with sharpened reward

Reward temperature decay and learning rate optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow