π€ AI Summary
This paper addresses the challenge of fusing heterogeneous event-camera and LiDAR data to generate dense depth maps. We propose the first Transformer-based deep fusion framework for this task. Methodologically, we design a dual-stream encoder comprising an event voxel encoder and a LiDAR sparse point cloud projector, augmented with intra-modal self-attention to model spatiotemporal dependencies within each modality and cross-modal cross-attention to achieve fine-grained spatiotemporal alignment and complementary information integration. Our key contribution is the first systematic integration of Transformer architectures into eventβLiDAR depth estimation, overcoming limitations of conventional CNN-based or hand-crafted alignment approaches. Evaluated on standard event-based depth estimation benchmarks, our method establishes new state-of-the-art performance: it reduces absolute depth error in the near range (<5 m) by up to 4Γ compared to prior best methods, significantly improving both accuracy and robustness of dense depth reconstruction.
π Abstract
Event cameras and LiDARs provide complementary yet distinct data: respectively, asynchronous detections of changes in lighting versus sparse but accurate depth information at a fixed rate. To this day, few works have explored the combination of these two modalities. In this article, we propose a novel neural-network-based method for fusing event and LiDAR data in order to estimate dense depth maps. Our architecture, DELTA, exploits the concepts of self- and cross-attention to model the spatial and temporal relations within and between the event and LiDAR data. Following a thorough evaluation, we demonstrate that DELTA sets a new state of the art in the event-based depth estimation problem, and that it is able to reduce the errors up to four times for close ranges compared to the previous SOTA.