🤖 AI Summary
This work addresses the safety limitations of existing end-to-end autonomous driving systems in sudden pedestrian crossing scenarios, where frame-based sensors suffer from perceptual latency and motion blur. To overcome these challenges, the authors propose a multimodal perception framework that fuses LiDAR point clouds with event streams from dynamic vision sensors (DVS). The core innovation lies in a Transformer-inspired cross-modal attention mechanism that dynamically aligns the two asynchronous modalities, coupled with a hybrid policy network to generate safe and formally verifiable local trajectories and control commands. The resulting system exhibits strong robustness to varying illumination conditions and achieves microsecond-level responsiveness. Evaluated on a newly curated multimodal dataset, the approach attains state-of-the-art performance, significantly reducing trajectory and control errors while entirely eliminating issues related to exposure failure and motion blur.
📝 Abstract
Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.