Vision-Based Localization and LLM-based Navigation for Indoor Environments

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address indoor navigation challenges in GPS-denied environments with complex building layouts, this paper proposes an infrastructure-free end-to-end vision–language collaborative navigation system. Methodologically, a two-stage fine-tuned ResNet-50 enables robust visual localization; the navigation module employs systematic prompt engineering to guide large language models (e.g., ChatGPT) in parsing publicly available floor plan images and generating natural-language path instructions. The core contribution is the first deep integration of a vision-based localization model with a large language model, establishing a direct mapping from input imagery to semantically grounded navigation commands. Evaluated in real-world office corridor settings, the system achieves 96% localization accuracy and 75% average instruction accuracy, demonstrating its effectiveness and feasibility under practical constraints—including low-light conditions and stringent response-time requirements.

Technology Category

Application Category

📝 Abstract

Indoor navigation remains a complex challenge due to the absence of reliable GPS signals and the architectural intricacies of large enclosed environments. This study presents an indoor localization and navigation approach that integrates vision-based localization with large language model (LLM)-based navigation. The localization system utilizes a ResNet-50 convolutional neural network fine-tuned through a two-stage process to identify the user's position using smartphone camera input. To complement localization, the navigation module employs an LLM, guided by a carefully crafted system prompt, to interpret preprocessed floor plan images and generate step-by-step directions. Experimental evaluation was conducted in a realistic office corridor with repetitive features and limited visibility to test localization robustness. The model achieved high confidence and an accuracy of 96% across all tested waypoints, even under constrained viewing conditions and short-duration queries. Navigation tests using ChatGPT on real building floor maps yielded an average instruction accuracy of 75%, with observed limitations in zero-shot reasoning and inference time. This research demonstrates the potential for scalable, infrastructure-free indoor navigation using off-the-shelf cameras and publicly available floor plans, particularly in resource-constrained settings like hospitals, airports, and educational institutions.

Problem

Research questions and friction points this paper is trying to address.

Indoor navigation without GPS signals in complex environments

Vision-based localization using smartphone camera and ResNet-50

LLM-based navigation interpreting floor plans for step-by-step directions

Innovation

Methods, ideas, or system contributions that make the work stand out.

ResNet-50 CNN for vision-based localization

LLM-based navigation with system prompts

Infrastructure-free indoor navigation solution

🔎 Similar Papers

F3Loc: Fusion and Filtering for Floorplan Localization