🤖 AI Summary
This work addresses the unnatural human intent inference and spatial configuration generation in human-robot object handover. We propose the first generative handover system integrating motor imagery cognitive modeling. Methodologically, we introduce cognitive imagery modeling—pre-handover mental simulation of motion—into robotic handover tasks for the first time; combine vision-language multimodal intent understanding with diffusion-model-driven spatial configuration synthesis; and realize end-to-end simulation from “intending to hand over” to “how to hand over” under robot kinematic constraints. We further design a real-time intent decoding framework supporting fused RGB-D and speech perception. Experiments in real-world human-robot interaction demonstrate 92% intent recognition accuracy and 87% user-rated naturalness—significantly outperforming baselines—while exhibiting high fluency, interpretability, and environmental adaptability.
📝 Abstract
We propose a novel system for robot-to-human object handover that emulates human coworker interactions. Unlike most existing studies that focus primarily on grasping strategies and motion planning, our system focus on 1. inferring human handover intents, 2. imagining spatial handover configuration. The first one integrates multimodal perception-combining visual and verbal cues-to infer human intent. The second one using a diffusion-based model to generate the handover configuration, involving the spacial relationship among robot's gripper, the object, and the human hand, thereby mimicking the cognitive process of motor imagery. Experimental results demonstrate that our approach effectively interprets human cues and achieves fluent, human-like handovers, offering a promising solution for collaborative robotics. Code, videos, and data are available at: https://i3handover.github.io.