🤖 AI Summary
Traditional approaches decouple 3D object detection from trajectory prediction, hindering effective spatiotemporal dependency modeling and leading to error accumulation. This paper addresses vision-based 3D perception for autonomous driving by proposing the first streaming multi-task framework that jointly performs detection and trajectory prediction. We introduce the Prediction-Aware Detection Transformer (PADT), which explicitly incorporates future trajectory priors into the detection process. Further, we design a Streaming Prediction Transformer (SPT) that enables end-to-end, tracking-free temporal modeling via shared query memory across frames. To support long-horizon inference, we adopt a multi-hypothesis prediction memory queue. Evaluated on nuScenes, our method achieves 54.9% EPA (+9.3% absolute improvement), along with state-of-the-art mAP and minADE, significantly enhancing dynamic scene understanding.
📝 Abstract
We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.