🤖 AI Summary
This study systematically investigates the capability boundaries of vision-language models (VLMs) for augmented reality (AR) scene quality assessment—a task previously unexplored due to the absence of AR-specific evaluation benchmarks. To address this gap, we introduce DiverseAR, the first diverse, human-annotated AR dataset tailored for quality assessment. Leveraging three commercial VLMs—GPT, Gemini, and Claude—we conduct a multi-dimensional quantitative analysis across perception, description, and physical consistency tasks. Results show that VLMs achieve a 93% true-positive rate in AR scene detection and 71% descriptive accuracy; however, performance degrades significantly with suboptimal virtual object placement, low rendering fidelity, or violations of physical plausibility—particularly failing to detect seamless real-virtual interactions (e.g., virtual objects casting physically accurate shadows). This work establishes a novel paradigm for automated AR experience assessment, accompanied by a new benchmark dataset and actionable insights into VLM limitations in embodied, spatially grounded contexts.
📝 Abstract
Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.