๐ค AI Summary
This work proposes a novel multi-scale Transformer architecture that integrates multimodal inputsโsuch as visual and motion cuesโto enhance the accuracy of pedestrian crossing intention prediction for Level 3โ4 autonomous driving. For the first time, a video Vision Transformer (ViT) is introduced to this task, leveraging its capacity to capture long-range spatiotemporal dependencies. The effectiveness of the proposed design is rigorously validated through systematic ablation studies. Evaluated on the JAAD dataset, the model achieves state-of-the-art performance, significantly outperforming existing methods across key metrics including Accuracy, AUC, and F1-score. These results demonstrate the potential of the approach to improve pedestrian interaction safety in high-level autonomous driving systems.
๐ Abstract
Pedestrian Intention prediction is one of the key technologies in the transition from level 3 to level 4 autonomous driving. To understand pedestrian crossing behaviour, several elements and features should be taken into consideration to make the roads of tomorrow safer for everybody. We introduce a transformer / video vision transformer based algorithm of different sizes which uses different data modalities .We evaluated our algorithms on popular pedestrian behaviour dataset, JAAD, and have reached SOTA performance and passed the SOTA in metrics like Accuracy, AUC and F1-score. The advantages brought by different model design choices are investigated via extensive ablation studies.