🤖 AI Summary
This work proposes an end-to-end multimodal perception and trajectory prediction framework to address the limitations of existing modular autonomous driving systems, including constrained information flow, error accumulation, and insufficient fusion of camera and LiDAR modalities in query space. By jointly optimizing detection, tracking, and multi-hypothesis trajectory prediction, the framework directly outputs behavior predictions from raw sensor inputs. Its core innovation is the Query-Gated Deformable Fusion (QGDF) mechanism, which aggregates image features via masked cross-view attention, extracts LiDAR contextual information through learnable-offset BEV differentiable sampling, and dynamically weights visual and geometric cues using query-conditioned gating for adaptive, fully differentiable multimodal fusion. Evaluated on nuScenes, the method achieves an EPA of 0.335, mAP of 0.502, a false positive rate of 0.147, and an inference time of 139.82 ms, significantly outperforming current approaches.
📝 Abstract
End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.