Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Visual-language models (VLMs) exhibit uncharacterized generalization limitations when deployed in embodied intelligence settings involving novel physical objects—particularly those with unfamiliar geometry or material properties, such as 3D-printed artifacts. Method: We conduct the first systematic evaluation of VLM generalization to real versus 3D-printed objects under single- and multi-view RGB inputs, benchmarking mainstream models (e.g., BLIP) on scene description tasks using a curated paired dataset captured with an RGB camera. Performance is quantitatively assessed across accuracy, completeness, and naturalness of object recognition and semantic generation. Contribution/Results: Results reveal strong performance on common real-world objects but significant degradation on 3D-printed counterparts—highlighting VLMs’ reliance on training-domain physical priors and limited out-of-distribution physical representation learning. While multi-view inputs improve robustness, they fail to resolve the fundamental generalization bottleneck. This work establishes a critical benchmark for deploying VLMs in robot perception–understanding loops and identifies concrete directions for improving physical-world generalization.

Technology Category

Application Category

📝 Abstract

Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models on real vs 3D-printed objects

Comparing single-view and multi-view captioning performance

Assessing object recognition accuracy in robotic settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates VLMs on real and 3D-printed objects

Compares single-view and multi-view captioning strategies

Tests object identification accuracy and caption naturalness

🔎 Similar Papers

No similar papers found.

Authors to Follow