VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work addresses two key challenges in training Vision-Language-Action (VLA) models using UMI data: the distributional mismatch between fisheye-view observations and pre-trained vision-language models, and the prevalence of physically infeasible actions in human-collected trajectories. To bridge the visual representation gap, the authors introduce UMI-VQA, the first large-scale visual question answering dataset tailored to fisheye perspectives. They further propose a physics-aware trajectory filtering pipeline based on collision detection and motion continuity analysis to retain only executable actions. A two-stage co-training framework is then designed to jointly optimize visual-language understanding and action prediction. Experiments demonstrate that the proposed approach significantly outperforms baseline methods—including π₀.₅, LingBot-VLA, and Wall-X—in both simulation and real-world tasks, with the physics validation score serving as a reliable predictor of deployment success.
📝 Abstract
Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
UMI data
fisheye distortion
physical feasibility
robot manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
fisheye distortion alignment
physical validation
UMI-VQA dataset
co-training framework
🔎 Similar Papers
No similar papers found.
Siyuan Yang
Siyuan Yang
Wallenberg-NTU Presidential Postdoctoral Fellowship, Nanyang Technological University
Computer VisionAction Recognition
L
Linzheng Guo
Institute of AI (TeleAI), China Telecom; Northwestern Polytechnical University
O
Ouyang Lu
Institute of AI (TeleAI), China Telecom; Northwestern Polytechnical University
Z
Zhaxizhuoma
Shanghai Jiao Tong University
D
Daoran Zhang
Institute of AI (TeleAI), China Telecom; East China University of Science and Technology
X
Xinmiao Wang
Institute of AI (TeleAI), China Telecom; Harbin Institute of Technology
Ting Xiao
Ting Xiao
East China University of Science and Technology
Medical Image AnalysisFew-shot LearningReinforcement Learning
F
Fangzheng Yan
Institute of AI (TeleAI), China Telecom
Zhijun Chen
Zhijun Chen
Beihang University
Machine LearningNature Language Processing
Y
Yan Ding
Lumos Robotics; Fudan University
C
Chao Yu
Lumos Robotics
Chenjia Bai
Chenjia Bai
Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院, TeleAI)
Reinforcement LearningRoboticsEmbodied AI
X
Xuelong Li
Institute of AI (TeleAI), China Telecom