Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of pedestrian crossing intention prediction in first-person perspective videos by introducing vision-language models (VLMs) to this task for the first time, formulating it as a closed-ended visual question answering problem. By integrating multimodal contextual cues—including gaze fixation, ego-vehicle dynamics, and surrounding vehicle motion—and employing parameter-efficient fine-tuning strategies, the proposed approach substantially enhances model reasoning capabilities. Specifically, after fine-tuning Qwen3-VL-2B with these contextual signals, the method achieves a 14.5% accuracy improvement over specialized Transformer-based baselines, establishing a new state-of-the-art performance on this challenging task.
📝 Abstract
Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.
Problem

Research questions and friction points this paper is trying to address.

pedestrian crossing intention
egocentric vision
traffic safety prediction
visual question answering
vision language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision language models
egocentric vision
pedestrian intention decoding
parameter-efficient fine-tuning
visual question answering