🤖 AI Summary
This paper introduces the first real-time open-vocabulary semantic SLAM framework for uncalibrated monocular cameras, addressing online localization and semantic mapping in dynamic environments without depth sensors or offline calibration. Methodologically, it establishes a tightly coupled vision–language–geometry architecture: leveraging DINO features as visual foundations, enabling open-vocabulary semantic association via cross-modal alignment, designing an adaptive robust kernel function based on high-level features, and incorporating geometric constraints for real-time joint optimization. Notably, it robustly models both moving objects and movable static objects—without requiring RGB-D, IMU, or prior maps. Evaluated on multiple dynamic-scene benchmarks, it achieves state-of-the-art performance and, for the first time, enables online open-vocabulary semantic SLAM from pure RGB input.
📝 Abstract
We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.