🤖 AI Summary
Existing methods for 3D semantic scene generation rely on complex, task-specific architectures that struggle to balance simplicity with editing flexibility. This work proposes a novel approach that converts 3D semantic occupancy grids into multi-channel bird’s-eye-view (BEV) representations, enabling direct use of off-the-shelf 2D latent diffusion models—such as Stable Diffusion—for unconditional generation and editing without any additional training. Leveraging a quantized autoencoder, a UNet architecture, and a codebook that explicitly maps categories to discrete codes, the method supports diverse editing operations including sketch-guided synthesis, inpainting, and outpainting. Evaluated on the SemanticKITTI dataset, the proposed approach achieves superior unconditional generation performance compared to current 3D-specialized baselines.
📝 Abstract
3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.