🤖 AI Summary
Visual-language models (VLMs) exhibit uncharacterized generalization limitations when deployed in embodied intelligence settings involving novel physical objects—particularly those with unfamiliar geometry or material properties, such as 3D-printed artifacts.
Method: We conduct the first systematic evaluation of VLM generalization to real versus 3D-printed objects under single- and multi-view RGB inputs, benchmarking mainstream models (e.g., BLIP) on scene description tasks using a curated paired dataset captured with an RGB camera. Performance is quantitatively assessed across accuracy, completeness, and naturalness of object recognition and semantic generation.
Contribution/Results: Results reveal strong performance on common real-world objects but significant degradation on 3D-printed counterparts—highlighting VLMs’ reliance on training-domain physical priors and limited out-of-distribution physical representation learning. While multi-view inputs improve robustness, they fail to resolve the fundamental generalization bottleneck. This work establishes a critical benchmark for deploying VLMs in robot perception–understanding loops and identifies concrete directions for improving physical-world generalization.
📝 Abstract
Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.