X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address geometric inconsistency, appearance distortion, and insufficient controllability in large-scale 3D driving scene generation for autonomous driving simulation, this paper proposes a multimodal-aligned framework jointly generating semantic occupancy and multi-view images. Our method integrates diffusion modeling, 3D Gaussian splatting reconstruction, and alignment optimization to jointly preserve geometric accuracy and visual fidelity. Key contributions include: (1) the first cross-modal co-generation paradigm unifying semantic occupancy representation with multi-view image synthesis; (2) a consistency-aware scene outpainting technique enabling local edits to propagate continuously into global 3D space; and (3) an LLM-driven high-level semantic control interface supporting text-, layout-, and intent-based multi-granularity conditioning. Experiments demonstrate significant improvements in large-scale scene generation quality, enabling closed-loop simulation and interactive editing—thereby providing high-fidelity, editable synthetic data and virtual environments for autonomous driving development.

Technology Category

Application Category

📝 Abstract

Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose X-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that X-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Generating large-scale 3D driving scenes with spatial coherence

Achieving high fidelity in geometry and appearance for scenes

Providing flexible controllability for scene customization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granular control for detailed scene composition

Unified pipeline for 3D semantic occupancy and images

Consistency-aware scene outpainting for large-scale scenes

🔎 Similar Papers

No similar papers found.

Authors to Follow