Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of translating spatial information from natural language—particularly references beyond a robot’s perceptual field—into calibrated three-dimensional spatial belief distributions that can be effectively fused with multimodal sensor data. The authors propose the Language Sensor Model (LSM), which treats natural language as a calibratable “sensor” for the first time, explicitly modeling referential ambiguity and spatial uncertainty to map linguistic expressions and scene graph context into probabilistic spatial distributions. Integrated within the VL-Map framework, LSM enables unified probabilistic fusion of language observations with onboard perceptual data to generate 3D semantic belief maps. Experiments on the VLA-3D benchmark and real robotic platforms demonstrate that LSM is the only language-based predictor that maintains covariance calibration, yielding approximately a 70% improvement in probabilistic mass for target localization after fusion.
📝 Abstract
Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).
Problem

Research questions and friction points this paper is trying to address.

spatial belief estimation
natural language
3D scenes
language grounding
sensor fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Sensor Model
VL-Map
calibrated spatial belief
multimodal distribution
vision-language mapping