🤖 AI Summary
To address degraded speech-text alignment and environmental sound mismatch in text-to-speech (TTS) synthesis under noisy conditions, this paper proposes an environment-aware multimodal speech synthesis framework. Methodologically, we introduce the first dual-conditional diffusion Transformer architecture that jointly models phonetic content and ambient acoustic fields; design a cross-modal image-to-audio translation mechanism leveraging visual cues to guide environmental sound generation; and integrate synthetic-data pretraining with real-data fine-tuning, augmented by cross-modal alignment representation learning. Experiments on real-world datasets demonstrate significant improvements: mean opinion score (MOS) increases by 1.2, speech naturalness and multimodal consistency are enhanced, and both environmental fidelity and overall speech quality surpass existing state-of-the-art methods.
📝 Abstract
We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts. Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world datasets, showcasing significant improvements in both audio quality and modality integration.