VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the adverse impact of semantic errors—particularly for rare object categories—on free space estimation, collision detection, and temporal consistency in 3D semantic occupancy models. To mitigate this, the authors propose VISA, a training-time semantic auditing framework that leverages a vision-language model (VLM) not for embedding alignment but as a reliability-aware auditor. VISA generates structured semantic audits for each object instance, encompassing category hypotheses, confusions, attributes, and supporting evidence, which are propagated along trajectories to corresponding 3D voxels. A reliability-weighted distillation loss then transfers this knowledge to semantic logits, eliminating the need for VLM inference at test time. Evaluated on nuScenes, VISA improves the mIoU of OccWorld and GaussianWorld to 20.05 and 21.91, respectively, with GaussianWorld achieving 19.16 mIoU on objects and 16.79 on rare categories.

📝 Abstract

Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

Problem

Research questions and friction points this paper is trying to address.

3D occupancy

semantic errors

rare-class errors

visual language models

autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided auditing

3D occupancy

semantic distillation