MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing single-scene deep learning approaches for camera relocalization suffer from limited generalization and robustness across diverse environments. To address this, we propose MVL-Loc, an end-to-end multi-scene 6-DoF pose estimation framework that pioneers the integration of vision-language models (VLMs) into relocalization. Leveraging VLMs’ semantic priors and cross-modal alignment capabilities, MVL-Loc employs natural language instructions to guide feature learning, thereby enhancing semantic understanding and spatial relationship modeling in complex indoor and outdoor scenes. The method jointly fuses visual, linguistic, and geometric cues, enabling adaptive multimodal feature alignment and end-to-end pose regression. Evaluated on 7Scenes and Cambridge Landmarks, MVL-Loc achieves state-of-the-art performance in both position and orientation accuracy—outperforming all prior methods—demonstrating superior generalization across heterogeneous scenes and robustness for real-world deployment.

Technology Category

Application Category

📝 Abstract

Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.

Problem

Research questions and friction points this paper is trying to address.

Generalizable multi-scene camera relocalization using vision-language models

Overcoming lack of robustness in diverse environments for 6-DoF pose estimation

Enhancing semantic understanding of complex scenes via natural language guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for relocalization

Uses multimodal data for indoor and outdoor generalization

Employs natural language to guide scene learning

🔎 Similar Papers

No similar papers found.

Authors to Follow