🤖 AI Summary
Existing 4D human-object interaction generation methods are largely confined to single-object manipulation, struggling to model natural bimanual coordination and simultaneous multi-object control. This work proposes Dex2HOI, a unified diffusion model that, for the first time, enables text-driven generation of bimanual interactions with two objects. Dex2HOI employs a dual-stream diffusion architecture to separately model interactions with each object while enforcing hand coordination through bidirectional cross-attention. The approach integrates hand-relative object representations, full-horizon contact-aware conditioning, and a motion fusion network, enabling autoregressive window-based sampling to synthesize interaction sequences of arbitrary length. Evaluated on both single- and dual-object benchmarks, Dex2HOI achieves state-of-the-art performance and offers up to a 540× speedup over prior best methods at inference, enabling real-time, high-quality generation without test-time optimization.
📝 Abstract
Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text. At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence. By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation. Code and models will be released upon acceptance.