VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language-action (VLA) models suffer from feature distribution shifts induced by heterogeneous visual viewpoints (e.g., egocentric wrist vs. global camera views), severely limiting cross-environment generalization. To address this, we propose the Lightweight Perspective-Adaptive Fusion (LPAF) module, which aligns and fuses multi-view features directly in the 2D latent space—without requiring costly 3D reconstruction. Integrated into the RoboFlamingo architecture, our approach combines single-view fine-tuning, multi-view latent-space fusion, and contrastive-driven feature alignment to learn viewpoint-invariant representations. Evaluated on CALVIN, LIBERO, and a custom simulation benchmark, LPAF improves task success rates by 8%, 15%, and 30%, respectively, and demonstrates robust viewpoint adaptation on real robotic hardware. Our key contribution is the first purely 2D perspective-adaptive fusion mechanism for VLAs—achieving substantial gains in open-world generalization with minimal computational overhead.

Technology Category

Application Category

📝 Abstract
The Visual-Language-Action (VLA) models can follow text instructions according to visual observations of the surrounding environment. This ability to map multimodal inputs to actions is derived from the training of the VLA model on extensive standard demonstrations. These visual observations captured by third-personal global and in-wrist local cameras are inevitably varied in number and perspective across different environments, resulting in significant differences in the visual features. This perspective heterogeneity constrains the generality of VLA models. In light of this, we first propose the lightweight module VLA-LPAF to foster the perspective adaptivity of VLA models using only 2D data. VLA-LPAF is finetuned using images from a single view and fuses other multiview observations in the latent space, which effectively and efficiently bridge the gap caused by perspective inconsistency. We instantiate our VLA-LPAF framework with the VLA model RoboFlamingo to construct RoboFlamingo-LPAF. Experiments show that RoboFlamingo-LPAF averagely achieves around 8% task success rate improvement on CALVIN, 15% on LIBERO, and 30% on a customized simulation benchmark. We also demonstrate the developed viewadaptive characteristics of the proposed RoboFlamingo-LPAF through real-world tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses perspective heterogeneity in visual observations from different cameras
Bridges the gap caused by perspective inconsistency in VLA models
Enables more unconstrained robotic manipulation across varied environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight module for perspective adaptivity
Fuses multiview observations in latent space
Fine-tuned using single-view images
🔎 Similar Papers
No similar papers found.
J
Jinyue Bian
China, Beijing, Li Auto Inc.
Zhaoxing Zhang
Zhaoxing Zhang
Huazhong university of science and technology
Visual OdometryRoboticsExploration
Z
Zhengyu Liang
China, Beijing, Li Auto Inc.
S
Shiwei Zheng
China, Beijing, Li Auto Inc.
S
Shengtao Zhang
China, Beijing, Li Auto Inc.
R
Rong Shen
China, Beijing, Li Auto Inc.
C
Chen Yang
China, Beijing, Li Auto Inc.
A
Anzhou Hou
China, Beijing, Li Auto Inc.