🤖 AI Summary
To address insufficient static and dynamic scene understanding and limited safety guarantees in multimodal end-to-end autonomous driving systems, this paper proposes a temporal-guided multimodal fusion framework. The method explicitly incorporates ego-vehicle state sequences—such as steering angle, throttle, and waypoints—as guiding inputs and introduces a novel temporal-guided loss function to jointly optimize geometric perception features and control signals along the time dimension. By integrating geometric feature extraction, multimodal end-to-end learning, and waypoint prediction, the framework achieves 70% driving score, 94% route completion rate, and 0.78 infraction score in CARLA simulation—significantly outperforming existing end-to-end baselines. Key contributions include: (1) the first explicit use of ego-state temporal signals as guidance for multimodal fusion; (2) a temporally aligned optimization objective that bridges perception and control; and (3) state-of-the-art performance demonstrating improved scene understanding and driving safety.
📝 Abstract
Multi-modal end-to-end autonomous driving has shown promising advancements in recent work. By embedding more modalities into end-to-end networks, the system's understanding of both static and dynamic aspects of the driving environment is enhanced, thereby improving the safety of autonomous driving. In this paper, we introduce METDrive, an end-to-end system that leverages temporal guidance from the embedded time series features of ego states, including rotation angles, steering, throttle signals, and waypoint vectors. The geometric features derived from perception sensor data and the time series features of ego state data jointly guide the waypoint prediction with the proposed temporal guidance loss function. We evaluated METDrive on the CARLA leaderboard benchmarks, achieving a driving score of 70%, a route completion score of 94%, and an infraction score of 0.78.