Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating listener facial actions in bidirectional dialogue faces challenges in modeling high-dimensional action spaces and capturing long-term temporal dependencies; moreover, existing 3D Morphable Model (3DMM)-coefficient-based approaches incur prohibitive computational overhead, hindering real-time responsiveness. This paper proposes Facial Action Diffusion (FAD), the first framework to introduce diffusion models for listener facial action generation. We design an efficient, end-to-end audiovisual multimodal fusion architecture—the Efficient Listener Network (ELNet)—and optimize the generative process within a compact 3DMM coefficient space to drastically reduce computational complexity. Compared to state-of-the-art methods, FAD achieves comparable or superior visual fidelity while reducing inference latency by 99%, enabling, for the first time, low-latency, high-fidelity real-time listener facial animation generation.

Technology Category

Application Category

📝 Abstract
Generating realistic listener facial motions in dyadic conversations remains challenging due to the high-dimensional action space and temporal dependency requirements. Existing approaches usually consider extracting 3D Morphable Model (3DMM) coefficients and modeling in the 3DMM space. However, this makes the computational speed of the 3DMM a bottleneck, making it difficult to achieve real-time interactive responses. To tackle this problem, we propose Facial Action Diffusion (FAD), which introduces the diffusion methods from the field of image generation to achieve efficient facial action generation. We further build the Efficient Listener Network (ELNet) specially designed to accommodate both the visual and audio information of the speaker as input. Considering of FAD and ELNet, the proposed method learns effective listener facial motion representations and leads to improvements of performance over the state-of-the-art methods while reducing 99% computational time.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic listener facial motions in conversations
Overcoming high-dimensional action space and temporal dependencies
Reducing computational time for real-time interactive responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Facial Action Diffusion for motion synthesis
Integrates Efficient Listener Network for multimodal input
Reduces computational time by 99 percent
🔎 Similar Papers