OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current robot learning systems, which predominantly rely on vision and struggle to effectively perceive contact dynamics, thereby underperforming in contact-intensive tasks. To overcome this, the authors propose a unified multimodal human-robot interaction framework featuring a novel compact handheld sensing system that synchronously fuses RGB-D vision, tactile signals, internal and external force measurements, and trajectory data. A shared embodiment design ensures consistency between data collection and deployment. By integrating impedance control with diffusion policies, the approach enables unified modulation of motion and contact behaviors. The system supports dual-channel force feedback and natural interactive force perception, demonstrating robust multimodal sensing capabilities and superior downstream manipulation performance in tasks such as force-sensitive pick-and-place, interactive surface erasing, and tactile-guided selective release.

Technology Category

Application Category

📝 Abstract
UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI provides dual-force feedback through bilateral gripper feedback and natural perception of external interaction wrench in the handheld embodiment. Built on this interface, we extend diffusion policy with visual, tactile, and force-related observations, and deploy the learned policy through impedance-based execution for unified regulation of motion and contact behavior. Experiments demonstrate reliable sensing and strong downstream performance on force-sensitive pick-and-place, interactive surface erasing, and tactile-informed selective release. Overall, OmniUMI combines physically grounded multimodal data acquisition with human-aligned interaction, providing a scalable foundation for learning contact-rich manipulation.
Problem

Research questions and friction points this paper is trying to address.

contact-rich manipulation
tactile sensing
force feedback
multimodal interaction
physically grounded learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal interaction
physically grounded learning
tactile sensing
force feedback
diffusion policy
Shaqi Luo
Shaqi Luo
Beijing Academy of Artificial Intelligence,BAAI
Embodied AIWhole Body ControlImitation learning
Y
Yuanyuan Li
Beijing Academy of Artificial Intelligence, MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Youhao Hu
Beijing Academy of Artificial Intelligence
C
Chenhao Yu
Beijing Academy of Artificial Intelligence, Beijing Institute of Technology
C
Chaoran Xu
Beijing Academy of Artificial Intelligence, Beijing University of Posts and Telecommunications
J
Jiachen Zhang
Beijing Academy of Artificial Intelligence, Beijing Institute of Technology
G
Guocai Yao
Beijing Academy of Artificial Intelligence
Tiejun Huang
Tiejun Huang
Professor,School of Computer Science, Peking University
Visual Information Processing
R
Ran He
MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences
Zhongyuan Wang
Zhongyuan Wang
BAAI
Knowledge MiningDatabaseNLPText Understanding