Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current next-location prediction models lack human-like map reasoning capabilities, hindering effective integration of road topology and mobility trends. To address this, we propose a vision-guided trajectory reasoning framework that renders trajectories and road networks as images and leverages vision-language models (VLMs) to jointly model spatial structure and movement patterns. We introduce a two-stage training paradigm: first, supervised fine-tuning (SFT) to learn geometric constraints imposed by the road network; second, map-informed reinforcement learning for self-optimization. Crucially, we design a novel vision-guided location search evaluation mechanism, enabling end-to-end interpretable reasoning. Our method achieves state-of-the-art performance across four real-world city datasets, significantly outperforming mainstream large-model baselines while demonstrating superior cross-city generalization.

Technology Category

Application Category

📝 Abstract
Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps extbf{in the way that humans do}. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.
Problem

Research questions and friction points this paper is trying to address.

Predict next GPS location using visual map feedback
Enhance trajectory reasoning with Vision-Language Models
Improve cross-city generalization in location prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for trajectory prediction
Implements Vision-Guided Location Search method
Applies Reinforcement Learning from visual feedback
🔎 Similar Papers
No similar papers found.
R
Ruixing Zhang
the State Key Laboratory of Complex and Critical Software Environment, Beihang University
Y
Yang Zhang
the State Key Laboratory of Complex and Critical Software Environment, Beihang University
T
Tongyu Zhu
the State Key Laboratory of Complex and Critical Software Environment, Beihang University
Leilei Sun
Leilei Sun
Beihang University
Data MiningMachine LearningGraph Learning
W
Weifeng Lv
the State Key Laboratory of Complex and Critical Software Environment, Beihang University