LiDAR-based End-to-end Temporal Perception for Vehicle-Infrastructure Cooperation

📅 2024-11-22
🏛️ IEEE Internet of Things Journal
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate temporal perception in vehicle-infrastructure cooperative sensing—caused by occlusions, blind spots, and LiDAR calibration errors—this paper proposes LET-VIC, an end-to-end LiDAR-based temporal collaborative perception framework. Methodologically, we introduce the first Vehicle-Infrastructure Cross-attention (VIC) mechanism to jointly align spatiotemporal features across heterogeneous vehicle- and infrastructure-mounted LiDAR views. Additionally, we design a learnable Calibration Error Compensation (CEC) module that enables automatic calibration error correction and end-to-end joint training. Evaluated on the V2X-Seq-SPD benchmark, LET-VIC achieves 15.0% higher mAP and 17.3% higher AMOTA than the baseline LET-V, and outperforms state-of-the-art methods—including V2VNet—by at least 13.7% (mAP) and 13.1% (AMOTA). These gains demonstrate significantly enhanced detection and tracking robustness under complex urban driving scenarios with severe occlusion and sensor misalignment.

Technology Category

Application Category

📝 Abstract
Temporal perception, defined as the capability to detect and track objects across temporal sequences, serves as a fundamental component in autonomous driving systems. While single-vehicle perception systems encounter limitations, stemming from incomplete perception due to object occlusion and inherent blind spots, cooperative perception systems present their own challenges in terms of sensor calibration precision and positioning accuracy. To address these issues, we introduce LET-VIC, a LiDAR-based End-to-End Tracking framework for Vehicle-Infrastructure Cooperation (VIC). First, we employ Temporal Self-Attention and VIC Cross-Attention modules to effectively integrate temporal and spatial information from both vehicle and infrastructure perspectives. Then, we develop a novel Calibration Error Compensation (CEC) module to mitigate sensor misalignment issues and facilitate accurate feature alignment. Experiments on the V2X-Seq-SPD dataset demonstrate that LET-VIC significantly outperforms baseline models. Compared to LET-V, LET-VIC achieves +15.0% improvement in mAP and a +17.3% improvement in AMOTA. Furthermore, LET-VIC surpasses representative Tracking by Detection models, including V2VNet, FFNet, and PointPillars, with at least a +13.7% improvement in mAP and a +13.1% improvement in AMOTA without considering communication delays, showcasing its robust detection and tracking performance. The experiments demonstrate that the integration of multi-view perspectives, temporal sequences, or CEC in end-to-end training significantly improves both detection and tracking performance. All code will be open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Enhancing temporal object detection in autonomous driving
Addressing sensor calibration errors in cooperative perception
Improving multi-view data fusion for tracking accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

LiDAR-based end-to-end tracking framework
Temporal and VIC Cross-Attention modules
Calibration Error Compensation module
🔎 Similar Papers
No similar papers found.
Zhenwei Yang
Zhenwei Yang
Erasmus Medical Center Rotterdam
StatisticsJoint modelsLongitudinal analysisMicrosimulation modelsNeural networks
J
Jilei Mao
W
Wen-Yen Yang
Institute for AI Industry Research (AIR), Tsinghua University, Beijing 100083, China
Y
Yibo Ai
National Center for Materials Service Safety (NCMS), University of Science and Technology Beijing, Beijing 100083, China
Yu Kong
Yu Kong
Michigan State U, Assistant Professor; ACTION Lab, Director
computer visionmachine learningdata mining
H
Haibao Yu
Institute for AI Industry Research (AIR), Tsinghua University, Beijing 100083, China; The University of Hong Kong, Hong Kong 999077, China
Weidong Zhang
Weidong Zhang
Samsung Research America
Computer VisionImage Processing