Audio-Visual Cross-Modal Compression for Generative Face Video Coding

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative facial video coding (GFVC) methods overlook the substantial impact of audio on bitrate and fail to systematically model cross-modal audio-visual correlations. To address this, we propose the first unified audio-visual generative framework tailored for ultra-low-bitrate multimodal compression. Our method integrates video motion modeling, audio feature tokenization, cross-modal diffusion-based alignment, and shared latent representation learning—enabling synchronous reconstruction of both modalities from a single latent variable while supporting inter-modal conditional generation. This approach breaks away from conventional unimodal compression paradigms. Experimental results demonstrate significant rate-distortion improvements over VVC and state-of-the-art GFVC methods, establishing a new benchmark for efficient joint audio-visual compression.

Technology Category

Application Category

📝 Abstract
Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.
Problem

Research questions and friction points this paper is trying to address.

Compresses audio and video jointly using cross-modal coherence
Aligns audio and video features through unified diffusion process
Reconstructs modalities from shared representation at low bitrates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint audio-video compression with cross-modal alignment
Unified diffusion process for synchronized reconstruction
Cross-modal reconstruction in low-bitrate scenarios
🔎 Similar Papers
No similar papers found.
Youmin Xu
Youmin Xu
School of Electronic and Computer Engineering, Peking University
Mengxi Guo
Mengxi Guo
Multimedia Lab, Bytedance Inc. - Peking University
Computer visionVideo codecImage compressionImage processing
S
Shijie Zhao
Bytedance Inc., Shenzhen, China & San Diego, CA, USA
W
Weiqi Li
School of Electronic and Computer Engineering, Peking University
Junlin Li
Junlin Li
ByteDance Inc. - Georgia Institute of Technology - Tsinghua University
Video Compression and ProcessingVideo StreamingMachine LearningAIASIC Design
L
Li Zhang
Bytedance Inc., Shenzhen, China & San Diego, CA, USA
J
Jian Zhang
School of Electronic and Computer Engineering, Peking University