VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

📅 2024-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address degraded speech-text alignment and environmental sound mismatch in text-to-speech (TTS) synthesis under noisy conditions, this paper proposes an environment-aware multimodal speech synthesis framework. Methodologically, we introduce the first dual-conditional diffusion Transformer architecture that jointly models phonetic content and ambient acoustic fields; design a cross-modal image-to-audio translation mechanism leveraging visual cues to guide environmental sound generation; and integrate synthetic-data pretraining with real-data fine-tuning, augmented by cross-modal alignment representation learning. Experiments on real-world datasets demonstrate significant improvements: mean opinion score (MOS) increases by 1.2, speech naturalness and multimodal consistency are enhanced, and both environmental fidelity and overall speech quality surpass existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.
Problem

Research questions and friction points this paper is trying to address.

Speech Synthesis
Environmental Adaptability
Audio Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

VoiceDiT
Dual-DiT
Image-to-Audio Translator
🔎 Similar Papers
No similar papers found.