Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work on gesture generation predominantly focuses on speaker-only motion synthesis, neglecting listener dynamic feedback and bidirectional speaker–listener interaction. This paper introduces the first speaker–listener co-generative diffusion framework, explicitly modeling the listener’s real-time full-body responses as conditional inputs to jointly synthesize temporally aligned, speech-driven full-body gestures for both participants. Our method integrates interactive condition augmentation, GAN-assisted large-step denoising, and multimodal temporal alignment. Quantitative evaluations demonstrate significant improvements over state-of-the-art methods across naturalness, motion coherence, and speech–gesture synchronization. Objective metrics—including FID, L1 joint velocity error, and sync error—show consistent gains; subjective user studies further confirm superior perceptual quality and interaction realism.

Technology Category

Application Category

📝 Abstract
Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication. For the first time, we integrate the full-body gestures of listeners into the generation framework. By devising a novel inter-diffusion mechanism, this model can accurately capture the complex interaction patterns between speakers and listeners during communication. In the model construction process, based on the advanced diffusion model architecture, we innovatively introduce interaction conditions and the GAN model to increase the denoising step size. As a result, when generating gesture sequences, the model can not only dynamically generate based on the speaker's speech information but also respond in realtime to the listener's feedback, enabling synergistic interaction between the two. Abundant experimental results demonstrate that compared with the current state-of-the-art gesture generation methods, the model we proposed has achieved remarkable improvements in the naturalness, coherence, and speech-gesture synchronization of the generated gestures. In the subjective evaluation experiments, users highly praised the generated interaction scenarios, believing that they are closer to real life human communication situations. Objective index evaluations also show that our model outperforms the baseline methods in multiple key indicators, providing more powerful support for effective communication.
Problem

Research questions and friction points this paper is trying to address.

Model integrates listener gestures for realistic communication dynamics
Captures speaker-listener interaction patterns via inter-diffusion mechanism
Enhances gesture naturalness and synchronization with speech feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates listener gestures into generation framework
Uses inter-diffusion mechanism for interaction patterns
Combines diffusion model with GAN for denoising
🔎 Similar Papers
No similar papers found.
J
Jinhe Huang
Nanjing Agriculture University
Yongkang Cheng
Yongkang Cheng
Mohamed bin Zayed University of Artificial Intelligence
Motion CaptureMotion GenerationEmbodied AI
Gaoge Han
Gaoge Han
MBZUAI
Embodied AIDigital HumanComputer Vision
J
Jinewei Li
University of Chinese Academy of Sciences
J
Jing Zhang
University of Chinese Academy of Sciences
X
Xingjian Gu
Nanjing Agriculture University