KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper introduces the first real-time open-vocabulary semantic SLAM framework for uncalibrated monocular cameras, addressing online localization and semantic mapping in dynamic environments without depth sensors or offline calibration. Methodologically, it establishes a tightly coupled vision–language–geometry architecture: leveraging DINO features as visual foundations, enabling open-vocabulary semantic association via cross-modal alignment, designing an adaptive robust kernel function based on high-level features, and incorporating geometric constraints for real-time joint optimization. Notably, it robustly models both moving objects and movable static objects—without requiring RGB-D, IMU, or prior maps. Evaluated on multiple dynamic-scene benchmarks, it achieves state-of-the-art performance and, for the first time, enables online open-vocabulary semantic SLAM from pure RGB input.

Technology Category

Application Category

📝 Abstract

We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.

Problem

Research questions and friction points this paper is trying to address.

Online open-vocabulary semantic SLAM for uncalibrated monocular cameras

Robustly handling dynamic scenes and movable objects

Fusing visual, geometric, and language features for real-time mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tightly couples DINO visual features with geometric constraints

Fuses geometric and deep visual features with language embeddings

Operates online with uncalibrated monocular cameras in dynamic environments

🔎 Similar Papers

No similar papers found.

Authors to Follow