🤖 AI Summary
This study investigates the impact of visual clutter on robotic manipulation performance of vision-language-action (VLA) models. Addressing the lack of quantifiable clutter modeling and systematic evaluation in prior work, we introduce the first psychophysics-inspired clutter metric—integrating distractor count, occlusion severity, and spatial distribution characteristics. We conduct unified benchmarking of mainstream VLA models across photorealistic simulation and physical robot platforms, revealing divergent vulnerability patterns and a consistent decline in task success under clutter. We further demonstrate that clutter degree significantly predicts performance degradation—up to 34%—and show that standard fine-tuning fails to uniformly mitigate diverse clutter effects. Our core contribution is the first interpretable, reproducible visual clutter assessment framework for VLA models, providing both theoretical grounding and empirical benchmarks to guide the design of robust multimodal robotic systems.
📝 Abstract
In this work, we propose an evaluation protocol for examining the performance of robotic manipulation policies in cluttered scenes. Contrary to prior works, we approach evaluation from a psychophysical perspective, therefore we use a unified measure of clutter that accounts for environmental factors as well as the distractors quantity, characteristics, and arrangement. Using this measure, we systematically construct evaluation scenarios in both hyper-realistic simulation and real-world and conduct extensive experimentation on manipulation policies, in particular vision-language-action (VLA) models. Our experiments highlight the significant impact of scene clutter, lowering the performance of the policies, by as much as 34% and show that despite achieving similar average performance across the tasks, different VLA policies have unique vulnerabilities and a relatively low agreement on success scenarios. We further show that our clutter measure is an effective indicator of performance degradation and analyze the impact of distractors in terms of their quantity and occluding influence. At the end, we show that finetuning on enhanced data, although effective, does not equally remedy all negative impacts of clutter on performance.