VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses two key challenges in training Vision-Language-Action (VLA) models using UMI data: the distributional mismatch between fisheye-view observations and pre-trained vision-language models, and the prevalence of physically infeasible actions in human-collected trajectories. To bridge the visual representation gap, the authors introduce UMI-VQA, the first large-scale visual question answering dataset tailored to fisheye perspectives. They further propose a physics-aware trajectory filtering pipeline based on collision detection and motion continuity analysis to retain only executable actions. A two-stage co-training framework is then designed to jointly optimize visual-language understanding and action prediction. Experiments demonstrate that the proposed approach significantly outperforms baseline methods—including π₀.₅, LingBot-VLA, and Wall-X—in both simulation and real-world tasks, with the physics validation score serving as a reliable predictor of deployment success.

📝 Abstract

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

UMI data

fisheye distortion

physical feasibility

robot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

fisheye distortion alignment

physical validation

UMI-VQA dataset

co-training framework

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)