🤖 AI Summary
This work addresses the limitations of current robot learning systems, which predominantly rely on vision and struggle to effectively perceive contact dynamics, thereby underperforming in contact-intensive tasks. To overcome this, the authors propose a unified multimodal human-robot interaction framework featuring a novel compact handheld sensing system that synchronously fuses RGB-D vision, tactile signals, internal and external force measurements, and trajectory data. A shared embodiment design ensures consistency between data collection and deployment. By integrating impedance control with diffusion policies, the approach enables unified modulation of motion and contact behaviors. The system supports dual-channel force feedback and natural interactive force perception, demonstrating robust multimodal sensing capabilities and superior downstream manipulation performance in tasks such as force-sensitive pick-and-place, interactive surface erasing, and tactile-guided selective release.
📝 Abstract
UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI provides dual-force feedback through bilateral gripper feedback and natural perception of external interaction wrench in the handheld embodiment. Built on this interface, we extend diffusion policy with visual, tactile, and force-related observations, and deploy the learned policy through impedance-based execution for unified regulation of motion and contact behavior. Experiments demonstrate reliable sensing and strong downstream performance on force-sensitive pick-and-place, interactive surface erasing, and tactile-informed selective release. Overall, OmniUMI combines physically grounded multimodal data acquisition with human-aligned interaction, providing a scalable foundation for learning contact-rich manipulation.