Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

๐Ÿ“… 2026-03-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of predicting pedestrian crossing intent in urban autonomous driving by proposing a lightweight, risk-aware framework that integrates four behavioral cues: attention, position, context, and interaction. It uniquely combines psychological behavioral features with a compact 4-token Transformer architecture, leveraging a highway encoder, global self-attention pooling, and a variational bottleneck to enable effective multi-stream feature fusion. The method further incorporates Mahalanobis distanceโ€“based distribution shift detection and uncertainty quantification to produce calibrated probabilities and risk scores. Evaluated on PSI 1.0, the model achieves 0.90 F1, 0.94 AUC-ROC, and 0.78 MCC; on PSI 2.0, it establishes a new baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction improves accuracy by 0.4 percentage points, while the approach maintains modality-agnosticism, interpretability, and deployment efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
Problem

Research questions and friction points this paper is trying to address.

Pedestrian Crossing Intent Prediction
Autonomous Vehicles
Urban Environments
Intention Prediction
Risk-aware Prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer fusion
uncertainty quantification
pedestrian intention prediction
structured behavioral features
Mahalanobis distance
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Sima Ashayer
The University of Tennessee at Chattanooga, Chattanooga, TN, USA
Hoang H. Nguyen
Hoang H. Nguyen
Research Assistant Professor at University of Tennessee at Chattanooga
Graph LearningMachine LearningBlockchain SecuritySmart Transportation
Y
Yu Liang
The University of Tennessee at Chattanooga, Chattanooga, TN, USA
M
Mina Sartipi
The University of Tennessee at Chattanooga, Chattanooga, TN, USA