🤖 AI Summary
Existing single-scene deep learning approaches for camera relocalization suffer from limited generalization and robustness across diverse environments. To address this, we propose MVL-Loc, an end-to-end multi-scene 6-DoF pose estimation framework that pioneers the integration of vision-language models (VLMs) into relocalization. Leveraging VLMs’ semantic priors and cross-modal alignment capabilities, MVL-Loc employs natural language instructions to guide feature learning, thereby enhancing semantic understanding and spatial relationship modeling in complex indoor and outdoor scenes. The method jointly fuses visual, linguistic, and geometric cues, enabling adaptive multimodal feature alignment and end-to-end pose regression. Evaluated on 7Scenes and Cambridge Landmarks, MVL-Loc achieves state-of-the-art performance in both position and orientation accuracy—outperforming all prior methods—demonstrating superior generalization across heterogeneous scenes and robustness for real-world deployment.
📝 Abstract
Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.