Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current end-to-end autonomous driving models demonstrate strong performance in closed-loop simulation but lack systematic evaluation of behavioral safety in real-world, high-frequency, safety-critical scenarios. To address this gap, this work introduces Safe2Drive, a benchmark suite comprising 100 scenarios across three representative risk categories: construction zones, pedestrians running red lights, and occluded vulnerable road users. We further propose the SafeDriving Score, a multidimensional safety metric integrating pre-collision braking, construction-zone contact avoidance, lane centering, and driving smoothness. Built upon the CARLA platform, Safe2Drive constitutes the first systematic extension of Bench2Drive-aligned closed-loop evaluation scenarios. Experimental results reveal significant performance degradation for state-of-the-art models LEAD and SimLingo, with scores of 11.85 and 15.27 respectively, exposing critical deficiencies in their safety-aware reasoning capabilities.

📝 Abstract

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

safety-critical scenarios

end-to-end models

vulnerable road users

driving safety evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end autonomous driving

safety evaluation

SafeDriving Score