KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the first real-time open-vocabulary semantic SLAM framework for uncalibrated monocular cameras, addressing online localization and semantic mapping in dynamic environments without depth sensors or offline calibration. Methodologically, it establishes a tightly coupled vision–language–geometry architecture: leveraging DINO features as visual foundations, enabling open-vocabulary semantic association via cross-modal alignment, designing an adaptive robust kernel function based on high-level features, and incorporating geometric constraints for real-time joint optimization. Notably, it robustly models both moving objects and movable static objects—without requiring RGB-D, IMU, or prior maps. Evaluated on multiple dynamic-scene benchmarks, it achieves state-of-the-art performance and, for the first time, enables online open-vocabulary semantic SLAM from pure RGB input.

Technology Category

Application Category

📝 Abstract
We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.
Problem

Research questions and friction points this paper is trying to address.

Online open-vocabulary semantic SLAM for uncalibrated monocular cameras
Robustly handling dynamic scenes and movable objects
Fusing visual, geometric, and language features for real-time mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tightly couples DINO visual features with geometric constraints
Fuses geometric and deep visual features with language embeddings
Operates online with uncalibrated monocular cameras in dynamic environments
🔎 Similar Papers
No similar papers found.
Z
Zaid Nasser
Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Saint Petersburg, Russia
M
Mikhail Iumanov
Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Saint Petersburg, Russia
T
Tianhao Li
Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Saint Petersburg, Russia
M
Maxim Popov
Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Saint Petersburg, Russia
Jaafar Mahmoud
Jaafar Mahmoud
Ph.D. student, ITMO university
Mobile RoboticsMachine learningSLAM
Malik Mohrat
Malik Mohrat
PhD student, ITMO University
Computer VisionMobile RoboticsMappingML
I
Ilya Obrubov
SBERRoboticsCenter, Moscow, Russia
E
Ekaterina Derevyanka
SBERRoboticsCenter, Moscow, Russia
I
Ivan Sosin
SBERRoboticsCenter, Moscow, Russia
S
Sergey Kolyubin
Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University, Saint Petersburg, Russia