🤖 AI Summary
Indoor navigation faces challenges including the absence of GPS signals, complex spatial layouts, and difficulty in satisfying personalized user requirements. This paper proposes, for the first time, an end-to-end approach leveraging multimodal large language models (LLMs) to directly generate natural-language navigation instructions from indoor map images—bypassing explicit mapping or path-planning modules. Our method employs prompt engineering that jointly integrates image captioning and spatial reasoning capabilities. We evaluate the approach on a real-world indoor map test set and establish a human-centered evaluation framework. Results show an average instruction correctness rate of 52% (up to 62%), with performance primarily influenced by point-of-interest density and visual redundancy—not topological complexity. This work introduces a novel paradigm for lightweight, interpretable, and user-adaptive indoor navigation systems.
📝 Abstract
Indoor navigation presents unique challenges due to complex layouts, lack of GPS signals, and accessibility concerns. Existing solutions often struggle with real-time adaptability and user-specific needs. In this work, we explore the potential of a Large Language Model (LLM), i.e., ChatGPT, to generate natural, context-aware navigation instructions from indoor map images. We design and evaluate test cases across different real-world environments, analyzing the effectiveness of LLMs in interpreting spatial layouts, handling user constraints, and planning efficient routes. Our findings demonstrate the potential of LLMs for supporting personalized indoor navigation, with an average of 52% correct indications and a maximum of 62%. The results do not appear to depend on the complexity of the layout or the complexity of the expected path, but rather on the number of points of interest and the abundance of visual information, which negatively affect the performance.