🤖 AI Summary
This work addresses the challenge of limited 3D data availability in LiDAR scene generation, which often constrains generative quality. To overcome this, the authors propose the first approach that transfers knowledge from large-scale RGB image-pretrained generative models to unconditional LiDAR point cloud synthesis. By integrating self-supervised 3D representation learning with a cross-modal feature alignment mechanism, the method enables high-fidelity and controllable point cloud generation within diffusion or flow-matching frameworks. Evaluated on the KITTI-360 benchmark, the approach achieves state-of-the-art performance and supports advanced editing capabilities such as object inpainting and scene blending, significantly enhancing both the realism and controllability of the generated LiDAR scenes.
📝 Abstract
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.