JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

📅 2024-08-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Existing 3D talking-head generation methods struggle to simultaneously achieve lip-sync accuracy, facial expressiveness, natural head pose dynamics, and high video fidelity. To address this, we propose the first speech-driven 3D talking-head generation framework built upon the Jamba architecture—a hybrid of Transformer and Mamba—leveraging mixed blocks that integrate Structured State Space Models (SSMs) with self-attention for efficient long-range temporal modeling. This design preserves computational efficiency while substantially enhancing motion diversity and temporal coherence. Our multimodal fusion approach achieves state-of-the-art performance across quantitative and qualitative metrics, including lip-sync error (LSE), expression fidelity, head pose naturalness, and video sharpness. Notably, it accelerates inference by 2.1× over prior methods, yielding more temporally coherent animations with richer geometric and textural detail.

Technology Category

Application Category

📝 Abstract

In recent years, talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high video quality. However, no single model has yet achieved equivalence across all these metrics. This paper aims to animate a 3D face using Jamba, a hybrid Transformers-Mamba model. Mamba, a pioneering Structured State Space Model (SSM) architecture, was designed to address the constraints of the conventional Transformer architecture. Nevertheless, it has several drawbacks. Jamba merges the advantages of both Transformer and Mamba approaches, providing a holistic solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and speed through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.

Problem

Research questions and friction points this paper is trying to address.

Generating realistic 3D talking heads from speech

Overcoming limitations in handling long sequences

Improving lip synchronization and motion variety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Transformer-Mamba model for animation

Multimodal integration enhances motion variety

Combines Transformer and Mamba for sequences

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads