🤖 AI Summary
To address the high computational cost and poor real-time deployability of large vision-language models (LVLMs) in vision-language navigation (VLN), this paper proposes a modular, decoupled navigation framework: Qwen2.5-VL-7B-Instruct is frozen as a fixed multimodal understanding backbone, while a lightweight action planning module is coupled on top. Key innovations include dual-frame visual encoding, structured trajectory memory, and customized prompt engineering to enhance cross-step decision consistency. Evaluated end-to-end on Habitat-Lab with Matterport3D under R2R and VLN-CE benchmarks, the framework achieves low-latency deployment. Compared to full fine-tuning, it reduces inference GPU memory by 62% and latency by 5.3×, while retaining competitive zero-shot generalization—though further improvement remains desirable. This work establishes a scalable, resource-efficient paradigm for VLN in compute-constrained environments.
📝 Abstract
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.