🤖 AI Summary
This work addresses the challenge of spatial hallucination in vision-language models operating in GPS-denied environments, where the absence of absolute scale information leads to unreliable spatial reasoning. To mitigate this, the authors propose VANGUARD, a lightweight geometry-aware module that leverages common vehicles as environmental anchors. By integrating oriented bounding boxes with kernel density estimation, VANGUARD infers pixel-to-meter correspondences and computes ground sample distance (GSD) using a pre-calibrated reference length, thereby providing large language model (LLM) agents with consistent metric scale. A composite confidence scoring mechanism enables agents to autonomously assess measurement reliability, reducing category dependence by 2.6× and catastrophic failure risk by 4×. Evaluated on DOTA v1.5, the method achieves a median GSD error of 6.87%; when combined with SAM for area estimation, it yields a median error of 19.7% across 100 test cases.
📝 Abstract
Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical -- yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark -- with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline -- demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.