🤖 AI Summary
Synthetic autonomous driving (AD) videos often suffer from geometric distortions, leading to substantially degraded performance in 3D perception tasks compared to real-world data. To address this, we propose a pixel-level supervision-free reinforcement learning framework that jointly optimizes diffusion-based video generation and downstream AD perception models. Our method introduces a hierarchical geometric reward system—incorporating point-, line-, and surface-level consistency as well as scene occupancy alignment—and a latent-space sliding-window optimization strategy. Furthermore, we design differentiable GeoScores to quantitatively measure geometric fidelity. Evaluated on nuScenes, our approach reduces visual localization error by 21%, depth estimation error by 57%, and improves 3D detection mAP by 12.7%, significantly narrowing the performance gap between synthetic and real data. The core contribution is the first realization of perception-guided, multi-level geometric consistency optimization, establishing a novel paradigm for high-fidelity synthetic AD data generation.
📝 Abstract
Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21%, Depth error by 57%) and dramatically improves 3D object detection mAP by 12.7%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.