DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of disentangling speaker timbre from background environmental attributes in environment-aware text-to-speech (TTS), this paper proposes the first zero-shot, disentangled audio completion framework. Methodologically, building upon the F5-TTS architecture, we introduce a pre-trained speech-environment separation module, a random span masking strategy, and a dual classifier-free guidance mechanism coupled with signal-to-noise ratio (SNR)-adaptive control—enabling independent, fine-grained control over linguistic content, speaker identity, and acoustic environment. Experiments demonstrate significant improvements over baselines across naturalness, speaker similarity, and environmental fidelity metrics. Our approach achieves high-quality, joint speech–environment synthesis for the first time, establishing a novel paradigm for personalized, context-adaptive TTS systems.

Technology Category

Application Category

📝 Abstract
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
Problem

Research questions and friction points this paper is trying to address.

Enabling independent control of speaker timbre and background environment
Disentangling environmental speech into clean speech and environment audio
Generating environmental personalized speech with high fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled audio infilling for environment control
Dual class-free guidance for enhanced controllability
Signal-to-noise ratio adaptation strategy alignment
🔎 Similar Papers
No similar papers found.
Y
Ye-Xin Lu
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China
Y
Yu Gu
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China
Kun Wei
Kun Wei
School of Computer Science, Northwestern Polytechnical University
deep learningcompute sciencespeech
H
Hui-Peng Du
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China
Yang Ai
Yang Ai
Associate Researcher, University of Science and Technology of China
Speech SynthesisSpeech EnhancementSpeech CodingDeep Learning
Z
Zhen-Hua Ling
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China