🤖 AI Summary
This work addresses the lack of reliability calibration and conflict detection between semantic outputs from foundation models and geometric perception in persistent mapping. To this end, the paper proposes a novel update operator that introduces, for the first time, a conflict-aware belief consistency mechanism. By integrating category-level calibration gating with an event-level conflict rejection window, the method effectively aligns semantic assertions with geometric evidence, preserving high-fidelity semantic information while discarding contradictory outputs. The system combines Mask2Former as the semantic segmenter within a persistent map fusion framework and demonstrates significant performance gains on KITTI-360 and ScanNet: it achieves 99.7% precision for the car class and improves mean IoU to 0.522, outperforming approaches based solely on calibration or end-to-end vision-language models.
📝 Abstract
Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.