Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Language model–based text-to-speech (TTS) systems suffer from pervasive speech hallucinations—spurious or ungrounded acoustic outputs inconsistent with input text. Method: This paper proposes a training-free post-hoc framework that introduces GFlowNet into TTS decoding for the first time. By modeling the distribution over generation trajectories, it formulates an enhanced subtrajectory balance objective, jointly incorporating sharpened internal reward signals and dynamic temperature annealing—all without incurring additional inference latency. Contribution/Results: The method requires no auxiliary training resources and effectively suppresses generation bias. Experiments on highly challenging test sets demonstrate over 50% reduction in character error rate (CER), up to 58% decrease in model uncertainty (measured via entropy or confidence calibration), and superior generalization and robustness across diverse speakers and prosodic conditions.

Technology Category

Application Category

📝 Abstract
Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in LM-based TTS models
Reducing character error rates without excessive resources
Aligning model distribution to minimize speech deviations
Innovation

Methods, ideas, or system contributions that make the work stand out.

GFlowNet-guided distribution alignment framework
Subtrajectory Balance objective with sharpened reward
Reward temperature decay and learning rate optimization
🔎 Similar Papers
No similar papers found.
C
Chenlin Liu
Harbin Institute of Technology, China
Minghui Fang
Minghui Fang
Zhejiang University
SpeechMulti-Modal LearningInformation Retrieval
P
Patrick Zhang
Zhejiang University, China
W
Wei Zhou
Zhejiang University, China
J
Jie Gao
Tsinghua University, Shenzhen, China
J
Jiqing Han
Harbin Institute of Technology, China