TacUMI: A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks

📅 2026-01-21
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately identifying semantic event boundaries in long-horizon manipulation tasks rich in physical contact, where reliance solely on visual and proprioceptive cues proves insufficient for effective task segmentation. To overcome this limitation, the authors propose TacUMI—a compact, multimodal data acquisition system that integrates ViTac visuo-tactile sensing with force-torque and pose perception—and, for the first time, embed it within a general-purpose manipulation interface to enable highly synchronized multimodal recording. Building upon this hardware foundation, they further introduce a temporal modeling–based multimodal fusion framework to automatically extract event boundaries from human demonstrations. Evaluated on a cable assembly task, the method achieves over 90% segmentation accuracy, significantly outperforming unimodal baselines and demonstrating the critical role of multimodal perception in enhancing task decomposition performance.

Technology Category

Application Category

📝 Abstract
Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Especially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive information often fails to reveal the underlying event transitions. This raises the requirement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules. Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, force-torque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semantically meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90 percent segmentation accuracy and highlights a remarkable improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks.
Problem

Research questions and friction points this paper is trying to address.

task decomposition
contact-rich tasks
multi-modal data
event segmentation
manipulation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal sensing
tactile perception
task segmentation
contact-rich manipulation
human demonstration
🔎 Similar Papers
No similar papers found.
T
Tailai Cheng
School of Computation, Information and Technology, Technical University of Munich, Germany
Kejia Chen
Kejia Chen
Technical University of Munich
Manipulation of Deformable ObjectsMulti-robot CollaborationLLM-based Planning
Lingyun Chen
Lingyun Chen
Munich Institute of Robotics and Machine Intelligence, TUM
L
Liding Zhang
School of Computation, Information and Technology, Technical University of Munich, Germany
Y
Yue Zhang
School of Computation, Information and Technology, Technical University of Munich, Germany
Y
Yao Ling
School of Computation, Information and Technology, Technical University of Munich, Germany
M
Mahdi Hamad
Agile Robots SE, Munich, Germany
Zhenshan Bing
Zhenshan Bing
Nanjing University / Technical University of Munich
Robotics
Fan Wu
Fan Wu
Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Wireless NetworkingMobile ComputingAlgorithmic Game Theory and Its Applications
Karan Sharma
Karan Sharma
AI @ Agile Robots SE
Applied Machine LearningRoboticsEmotion AIAffective Computing
Alois Knoll
Alois Knoll
Technische Universität München
RoboticsAISensor Data FusionAutonomous DrivingCyber Physical Systems