From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the profound semantic gap between visual and acoustic modalities in silent-video-to-speech (V2S) synthesis. We propose a three-level hierarchical cross-modal alignment framework that explicitly aligns lip movements, speaker identity, and facial dynamics with speech content, timbre, and prosody, respectively. To model the continuous transformation of speech distributions, we introduce flow matching—a first for V2S—combined with vision-acoustic joint embedding and temporal consistency constraints. Evaluated on LRS3 and other benchmarks, our method achieves a Mean Opinion Score (MOS) of 4.12 and a Word Error Rate (WER) of 12.3%, significantly outperforming prior state-of-the-art methods in timbre and prosody similarity. To our knowledge, this is the first end-to-end V2S approach to simultaneously advance naturalness, intelligibility, and speaker fidelity, establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
Problem

Research questions and friction points this paper is trying to address.

Bridge modality gap between silent video and speech
Generate high-quality speech from talking face videos
Align visual factors with acoustic counterparts for synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical video-to-speech transformation stages
Visual-acoustic alignment for seamless conversion
Flow matching model for realistic speech generation
🔎 Similar Papers
No similar papers found.