🤖 AI Summary
This paper addresses three key challenges in face-driven text-to-speech (TTS): low audio quality in audio-visual corpora, difficulty in vocalizing artistic portraits, and inconsistent generation under one-to-many mappings. To this end, we propose a multimodal controllable TTS framework. Methodologically, we introduce a novel two-stage training paradigm that jointly leverages audio-visual and audio-only data; design a stylized face enhancement module and a cross-modal representation alignment module to unify modeling of realistic and artistic portraits; and incorporate sampling-based decoding with text-guided speech prompting to enhance both diversity and consistency. Experiments demonstrate state-of-the-art performance in speech naturalness, fine-grained controllability (e.g., speaking rate, noise type, speaker distance, intonation), and cross-domain generalization—spanning both photorealistic and artistic visual domains.
📝 Abstract
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.