A Navigation Framework Utilizing Vision-Language Models

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor real-time deployability of large vision-language models (LVLMs) in vision-language navigation (VLN), this paper proposes a modular, decoupled navigation framework: Qwen2.5-VL-7B-Instruct is frozen as a fixed multimodal understanding backbone, while a lightweight action planning module is coupled on top. Key innovations include dual-frame visual encoding, structured trajectory memory, and customized prompt engineering to enhance cross-step decision consistency. Evaluated end-to-end on Habitat-Lab with Matterport3D under R2R and VLN-CE benchmarks, the framework achieves low-latency deployment. Compared to full fine-tuning, it reduces inference GPU memory by 62% and latency by 5.3×, while retaining competitive zero-shot generalization—though further improvement remains desirable. This work establishes a scalable, resource-efficient paradigm for VLN in compute-constrained environments.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.
Problem

Research questions and friction points this paper is trying to address.

Interpreting natural language instructions for visual navigation.
Reducing computational costs in real-time navigation systems.
Enhancing decision-making continuity in unfamiliar environments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework decouples vision-language and planning
Integrates frozen Qwen2.5-VL-7B with lightweight logic
Uses prompt engineering and two-frame visual strategy
Yicheng Duan
Yicheng Duan
Case Western Reserve University
Embodied AICV
K
Kaiyu Tang
Computer and Data Sciences, School of Engineering, Case Western Reserve University, Cleveland, Ohio, 44106, USA