MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-scene deep learning approaches for camera relocalization suffer from limited generalization and robustness across diverse environments. To address this, we propose MVL-Loc, an end-to-end multi-scene 6-DoF pose estimation framework that pioneers the integration of vision-language models (VLMs) into relocalization. Leveraging VLMs’ semantic priors and cross-modal alignment capabilities, MVL-Loc employs natural language instructions to guide feature learning, thereby enhancing semantic understanding and spatial relationship modeling in complex indoor and outdoor scenes. The method jointly fuses visual, linguistic, and geometric cues, enabling adaptive multimodal feature alignment and end-to-end pose regression. Evaluated on 7Scenes and Cambridge Landmarks, MVL-Loc achieves state-of-the-art performance in both position and orientation accuracy—outperforming all prior methods—demonstrating superior generalization across heterogeneous scenes and robustness for real-world deployment.

Technology Category

Application Category

📝 Abstract
Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.
Problem

Research questions and friction points this paper is trying to address.

Generalizable multi-scene camera relocalization using vision-language models
Overcoming lack of robustness in diverse environments for 6-DoF pose estimation
Enhancing semantic understanding of complex scenes via natural language guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language models for relocalization
Uses multimodal data for indoor and outdoor generalization
Employs natural language to guide scene learning
🔎 Similar Papers
No similar papers found.
Z
Zhendong Xiao
School of Automation Science and Engineering, South China University of Technology, Guangzhou, Guangdong Province, China
Wu Wei
Wu Wei
南方科技大学
lithium ion batteryanodesilicon
S
Shujie Ji
School of Automation Science and Engineering, South China University of Technology, Guangzhou, Guangdong Province, China
S
Shan Yang
School of Automation Science and Engineering, South China University of Technology, Guangzhou, Guangdong Province, China
Changhao Chen
Changhao Chen
HKUST-GZ
Embodied AIRoboticsInertial NavigationSLAM