InterMamba: Efficient Human-Human Interaction Generation with Adaptive Spatio-Temporal Mamba

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Transformer-based methods for human-human interaction motion generation suffer from low efficiency in long-sequence modeling, excessive parameter counts, and poor real-time responsiveness. To address these limitations, this paper proposes the first adaptive spatiotemporal Mamba framework specifically designed for interactive motion generation. Our method introduces a dual-branch state-space model (SSM) architecture and two novel Mamba modules—self-individual and cross-individual—to jointly and efficiently model individual motion dynamics and inter-personal dependencies. We further incorporate parallel spatiotemporal SSMs, adaptive gating, and cross-adaptive feature fusion to enhance expressiveness and efficiency. Evaluated on two standard interaction benchmarks, our approach achieves state-of-the-art performance with only 66M parameters (36% of InterGen’s) and an inference latency of 0.57 seconds per sample—2.2× faster than InterGen—demonstrating a significant improvement in both generation quality and computational efficiency.

Technology Category

Application Category

📝 Abstract
Human-human interaction generation has garnered significant attention in motion synthesis due to its vital role in understanding humans as social beings. However, existing methods typically rely on transformer-based architectures, which often face challenges related to scalability and efficiency. To address these issues, we propose a novel, efficient human-human interaction generation method based on the Mamba framework, designed to meet the demands of effectively capturing long-sequence dependencies while providing real-time feedback. Specifically, we introduce an adaptive spatio-temporal Mamba framework that utilizes two parallel SSM branches with an adaptive mechanism to integrate the spatial and temporal features of motion sequences. To further enhance the model's ability to capture dependencies within individual motion sequences and the interactions between different individual sequences, we develop two key modules: the self-adaptive spatio-temporal Mamba module and the cross-adaptive spatio-temporal Mamba module, enabling efficient feature learning. Extensive experiments demonstrate that our method achieves state-of-the-art results on two interaction datasets with remarkable quality and efficiency. Compared to the baseline method InterGen, our approach not only improves accuracy but also requires a minimal parameter size of just 66M ,only 36% of InterGen's, while achieving an average inference speed of 0.57 seconds, which is 46% of InterGen's execution time.
Problem

Research questions and friction points this paper is trying to address.

Efficient human-human interaction generation overcoming scalability issues
Adaptive spatio-temporal Mamba framework for long-sequence dependencies
Real-time feedback with minimal parameters and faster inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mamba framework for efficient interaction generation
Adaptive spatio-temporal SSM branches integrate features
Self-adaptive and cross-adaptive modules enhance learning
Z
Zizhao Wu
department of digital media technology, Hangzhou Dianzi University, Hangzhou 310018, China
Y
Yingying Sun
department of digital media technology, Hangzhou Dianzi University, Hangzhou 310018, China
Y
Yiming Chen
department of digital media technology, Hangzhou Dianzi University, Hangzhou 310018, China
X
Xiaoling Gu
department of computer science, Hangzhou Dianzi University, Hangzhou 310018, China
Ruyu Liu
Ruyu Liu
Marie Skłodowska-Curie Fellow in DTU
Urban spatial perception and BIPV optimizationendoscopic 3D perception
J
Jiazhou Chen
college of computer science and technology, Hangzhou Zhejiang University of Technology, Hangzhou 310023, China