ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for interactive head-motion generation in real-time dialogue suffer from reliance on future signals, insufficient contextual behavioral understanding, and discontinuous motion transitions. To address these issues, this paper proposes a frame-level autoregressive diffusion generative framework that models motion distributions directly in continuous action space—eliminating discrete codebooks. We introduce a novel bidirectional multimodal learning mechanism integrating speech and dialogue modalities, jointly leveraging voice activity detection and contextual state features to enable fine-grained dialogue state awareness and zero-latency responsiveness. Experiments demonstrate that our model significantly outperforms existing clip-wise and switching-based approaches: achieving ultra-low inference latency (<30 ms), improved motion naturalness (28.6% reduction in Fréchet Inception Distance), and enhanced interaction realism (32% increase in user-rated authenticity). The framework thus advances real-time, context-aware, and perceptually coherent conversational avatar animation.

Technology Category

Application Category

📝 Abstract
Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.
Problem

Research questions and friction points this paper is trying to address.

Real-time interactive head generation for conversations
Improving motion prediction accuracy in continuous space
Enhancing interaction realism through behavior and state understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive frame-wise framework for real-time generation
Diffusion-based motion prediction in continuous space
Dual-track dual-modal signals for interactive behavior understanding
🔎 Similar Papers
No similar papers found.
Ying Guo
Ying Guo
Center for Agricultural Resources Research, Chinese Academy of Sciences
HydrologyWater resourcesEcohydrologyRemote Sensing
X
Xi Liu
Vision AI Department, Meituan
C
Cheng Zhen
Vision AI Department, Meituan
P
Pengfei Yan
Vision AI Department, Meituan
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning