Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses three key challenges in face-driven text-to-speech (TTS): low audio quality in audio-visual corpora, difficulty in vocalizing artistic portraits, and inconsistent generation under one-to-many mappings. To this end, we propose a multimodal controllable TTS framework. Methodologically, we introduce a novel two-stage training paradigm that jointly leverages audio-visual and audio-only data; design a stylized face enhancement module and a cross-modal representation alignment module to unify modeling of realistic and artistic portraits; and incorporate sampling-based decoding with text-guided speech prompting to enhance both diversity and consistency. Experiments demonstrate state-of-the-art performance in speech naturalness, fine-grained controllability (e.g., speaking rate, noise type, speaker distance, intonation), and cross-domain generalization—spanning both photorealistic and artistic visual domains.

Technology Category

Application Category

📝 Abstract

This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.

Problem

Research questions and friction points this paper is trying to address.

Enhancing audio quality in face-driven TTS using high-quality audio corpora

Enabling voice generation from artistic portraits via image stylization

Ensuring consistent voice output with sampling-based decoding and prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes high-quality audio-only speech corpora

Augments input face image with stylization

Employs sampling-based decoding and prompting

🔎 Similar Papers

No similar papers found.

Authors to Follow