🤖 AI Summary
Current text-to-speech (TTS) systems prioritize speech naturalness but lack spatial awareness, limiting their applicability in immersive 3D interactive environments such as gaming and VR. To address this, we propose the first image-conditioned TTS framework. Our method introduces a visual scene encoder that maps environmental semantics into guidance signals for speech synthesis. We further design a novel reverberation classification module coupled with mel-spectrogram adaptive refinement, enabling precise alignment between acoustic spatial attributes—such as source distance, wall reflections, and azimuth—and the input visual scene. By fusing multimodal features, our approach preserves high speech naturalness while significantly enhancing spatial consistency and immersion within virtual environments. Extensive experiments demonstrate strong contextual awareness and spatial adaptability of our framework in 3D interactive scenarios.
📝 Abstract
Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthesized speech, which may provide immersive experience in gaming and virtual reality. To solve this issue, in this paper, we present a novel multi-modal TTS approach, namely Image-indicated Immersive Text-to-speech Synthesis (I2TTS). Specifically, we introduce a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline to control the speech generation process. Additionally, we propose a reverberation classification and refinement technique that adjusts the synthesized mel-spectrogram to enhance the immersive experience, ensuring that the involved reverberation condition matches the scene accurately. Experimental results demonstrate that our model achieves high-quality scene and spatial matching without compromising speech naturalness, marking a significant advancement in the field of context-aware speech synthesis. Project demo page: https://spatialTTS.github.io/ Index Terms-Speech synthesis, scene prompt, spatial perception