Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of vision-language-action (VLA) models to sensor-level image corruptions—such as electronic noise, dead pixels, and lens smudges—in real-world environments, which can severely degrade their performance. The study presents the first systematic investigation into the impact of such corruptions on VLA models and introduces a plug-and-play, model-agnostic restoration module for Vision Transformers, termed CRT. Leveraging adversarial training, CRT recovers clean visual signals end-to-end from corrupted inputs without requiring fine-tuning of the main VLA model. Experimental results on the LIBERO and Meta-World benchmarks demonstrate that CRT effectively restores task success rates under severe corruption—from as low as 2% back to nearly the original 90%—significantly enhancing the robustness of VLA systems in practical deployment scenarios.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $\pi_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
visual disturbances
image corruptions
sensor artifacts
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models
image corruption
robustness
Corruption Restoration Transformer
adversarial training
🔎 Similar Papers
D
Daniel Yezid Guarnizo Orjuela
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milano
L
Leonardo Scappatura
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milano
V
Veronica Di Gennaro
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milano
R
Riccardo Andrea Izzo
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milano
Gianluca Bardaro
Gianluca Bardaro
Research Fellow, AIRLab, Politecnico di Milano
Robotics
Matteo Matteucci
Matteo Matteucci
Full Professor, Department of Electronics Information and Bioengineering, Politecnico di Milano
RoboticsMachine LearningComputer VisionPattern Recognition