Fine-Tuning Vision-Language Models for Visual Navigation Assistance

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient localization accuracy and unreliable directional instruction generation in indoor navigation for visually impaired users, this paper proposes a vision-language joint-driven navigation method. Built upon the BLIP-2 architecture, the approach incorporates Low-Rank Adaptation (LoRA) for efficient fine-tuning and introduces an enhanced BERT F1 metric integrating directional and sequential constraints. The model is trained on a manually curated indoor navigation dataset, significantly improving both accuracy and logical coherence of multi-step navigational instructions. Experimental results demonstrate that the method effectively mitigates BLIP-2’s limitations in spatial relation modeling and structured instruction generation: directional recognition accuracy increases by 18.7%, and instruction execution success rate rises by 23.4%. This yields a more robust and interpretable end-to-end navigation system tailored for visually impaired individuals.

Technology Category

Application Category

📝 Abstract
We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.
Problem

Research questions and friction points this paper is trying to address.

Assisting visually impaired individuals with indoor navigation
Overcoming lack of precise indoor location data
Generating step-by-step navigational instructions using vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned BLIP-2 model with LoRA
Generated step-by-step navigational instructions
Proposed refined BERT F1 evaluation metric
🔎 Similar Papers
X
Xiao Li
University of Florida, Gainesville, FL, USA
B
Bharat Gandhi
University of Florida, Gainesville, FL, USA
Ming Zhan
Ming Zhan
University of Florida, Gainesville, FL, USA
M
Mohit Nehra
University of Florida, Gainesville, FL, USA
Zhicheng Zhang
Zhicheng Zhang
Carnegie Mellon University
Reinforcement LearningExplainable RL
Y
Yuchen Sun
University of Florida, Gainesville, FL, USA
Meijia Song
Meijia Song
University of Minnesota
Nursing InformaticsHealth Informatics
N
Naisheng Zhang
New York University, New York, NY, USA
X
Xi Wang
University of Florida, Gainesville, FL, USA