🤖 AI Summary
This study addresses the challenge of reconstructing high-fidelity 4D driving scenes—particularly in long-tail scenarios such as construction zones—from unlabeled, monocular real-world driving videos to enable closed-loop simulation. The authors propose Dash2Sim, a framework that integrates monocular depth estimation, visual localization, and map alignment to generate metrically scaled, georeferenced dense 4D scenes without manual annotations, with geometric consistency validated against independent map data. Key contributions include the release of ROADWork4D, a large-scale dataset spanning 17 cities with 4,244 scenes and 2.7 million 3D objects, along with its subset ROADWork4D-CL; empirical evidence showing significant performance degradation of existing planners in construction zones on this benchmark; and demonstration that the recovered depth improves novel view synthesis quality by up to 19% in perceptual metrics.
📝 Abstract
Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.