🤖 AI Summary
This work addresses the systematic errors and unreliable confidence estimation of vision-language models (VLMs) in spatial reasoning tasks, which hinder their deployment in safety-critical applications. The authors propose a novel confidence estimation framework grounded in external geometric verification, eschewing conventional text-based self-evaluation. Their approach integrates four visual signals—object detection, geometric alignment, spatial ambiguity, and internal VLM uncertainty—and combines them via a gradient-boosting model to predict reliability. Evaluated on BLIP-2 and CLIP, the method achieves AUROC scores of 0.674 and 0.583, respectively, representing improvements of 34.0% and 16.1% over baselines. At 60% target accuracy, it attains a coverage rate of 61.9%—more than twice that of the baseline—and boosts scene graph construction accuracy from 52.1% to 78.3%, demonstrating strong cross-architecture generalization.
📝 Abstract
Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1% improvement), generalizing across generative and classification architectures. Our framework enables selective prediction: at 60% target accuracy, we achieve 61.9% coverage versus 27.6% baseline (2.2x improvement) on BLIP-2. Feature analysis reveals vision-based signals contribute 87.4% of model importance versus 12.7% from VLM confidence, validating that external geometric verification outperforms self-assessment. We demonstrate reliable scene graph construction where confidence-based pruning improves precision from 52.1% to 78.3% while retaining 68.2% of edges.