🤖 AI Summary
To address insufficient pedestrian trajectory prediction accuracy and weak environmental constraint modeling in autonomous driving, this paper proposes a multi-source dynamic fusion method integrating fine-grained scene semantics with sparse social interaction. First, a sparse graph neural network models spatiotemporal pedestrian interactions. Second, high-discriminative scene features are extracted via Mask2Former-based semantic segmentation and image augmentation. Third, a cross-attention mechanism enables adaptive alignment and fusion of interaction and scene representations. Finally, a temporal convolutional network generates trajectory predictions. This work is the first to systematically unify sparse graph-based interaction modeling with pixel-level semantic understanding, introducing a cross-attention-driven dynamic feature fusion paradigm that overcomes the limitation of conventional methods in neglecting physical environmental constraints. Evaluated on standard benchmarks, it achieves an average displacement error (ADE) of 0.252 m and final displacement error (FDE) of 0.372 m—significantly surpassing state-of-the-art methods—and validates the effectiveness of jointly modeling scene context and social interaction.
📝 Abstract
Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene-object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian-scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross-attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.