Learning to Act Robustly with View-Invariant Latent Actions

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the significant degradation in generalization of existing vision-based robotic policies under viewpoint changes, primarily due to the absence of viewpoint-invariant representations that capture physical dynamics. To this end, the authors propose the VILA framework, which learns latent actions to model dynamic transition patterns across trajectories and leverages real action sequences to guide cross-view alignment, thereby achieving physically aware viewpoint-invariant representations. By explicitly coupling viewpoint invariance with physical dynamics, VILA substantially enhances policy robustness under unseen viewpoints and improves transfer efficiency to novel tasks. Extensive experiments in both simulated and real-world environments demonstrate the superior performance of VILA in cross-view generalization and task transfer.

Technology Category

Application Category

📝 Abstract

Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.

Problem

Research questions and friction points this paper is trying to address.

view-invariance

vision-based robotic policies

physical dynamics

viewpoint variability

robust generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

view-invariant representation

latent action

physical dynamics