Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an end-to-end multimodal perception and trajectory prediction framework to address the limitations of existing modular autonomous driving systems, including constrained information flow, error accumulation, and insufficient fusion of camera and LiDAR modalities in query space. By jointly optimizing detection, tracking, and multi-hypothesis trajectory prediction, the framework directly outputs behavior predictions from raw sensor inputs. Its core innovation is the Query-Gated Deformable Fusion (QGDF) mechanism, which aggregates image features via masked cross-view attention, extracts LiDAR contextual information through learnable-offset BEV differentiable sampling, and dynamically weights visual and geometric cues using query-conditioned gating for adaptive, fully differentiable multimodal fusion. Evaluated on nuScenes, the method achieves an EPA of 0.335, mAP of 0.502, a false positive rate of 0.147, and an inference time of 139.82 ms, significantly outperforming current approaches.

Technology Category

Application Category

📝 Abstract
End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.
Problem

Research questions and friction points this paper is trying to address.

camera-LiDAR fusion
end-to-end perception
trajectory prediction
query space
multimodal fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-Gated Deformable Fusion
end-to-end perception and prediction
camera-LiDAR fusion
differentiable BEV sampling
query-based multimodal learning
🔎 Similar Papers
No similar papers found.
Matej Halinkovic
Matej Halinkovic
Slovak University of Technology
Computer VisionDeep Learning
N
Nina Masarykova
Slovak University of Technology, Ilkovičova 2, 842 16 Bratislava, Slovakia
A
Alexey V. Vinel
Karlsruhe Institute of Technology, Kaiserstraße 89, 76133 Karlsruhe, Germany; Halmstad University, Kristian IV:s väg 3, 301 18 Halmstad, Sweden
M
M. Galinski
Slovak University of Technology, Ilkovičova 2, 842 16 Bratislava, Slovakia