MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

πŸ“… 2025-08-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Texture distortion in urban mesh models hinders their application in virtual navigation and autonomous driving simulation. To address this, we propose a three-stage diffusion framework that leverages urban meshes as geometric priors to jointly optimize cross-view consistency and stylistic coherence. First, a cascaded outpainting ControlNet guides large-scale scene expansion. Second, AGInpaint enables high-fidelity intermediate-view interpolation. Third, the GCAlign module performs global exposure and geometric alignment; notably, we introduce the first joint optimization of ControlNet and 3D Gaussian Splatting to ensure both geometric fidelity and visual continuity. Experiments demonstrate that our method significantly outperforms existing approaches in texture quality, cross-view consistency, and multi-style controllability, enabling scalable, high-fidelity urban scene synthesis.

Technology Category

Application Category

πŸ“ Abstract
Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.
Problem

Research questions and friction points this paper is trying to address.

Generate realistic textures for city mesh models lacking visual realism
Achieve cross-view consistency in 3D scene generation using diffusion models
Maintain geometric alignment while ensuring visual quality in outdoor scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mesh-guided diffusion for scene synthesis
Cascaded ControlNets for geometric consistency
3D Gaussian Splatting for scene reconstruction
πŸ”Ž Similar Papers
No similar papers found.
X
Xuyang Chen
TU Munich
Z
Zhijun Zhai
Wuhan University
K
Kaixuan Zhou
Huawei Riemann Lab
Zengmao Wang
Zengmao Wang
Associate Professor, School of Computer Science, Wuhan University
Artificial IntelligenceMachine LearningRemote Sensing
J
Jianan He
Huawei Riemann Lab
D
Dong Wang
Huawei Riemann Lab
Yanfeng Zhang
Yanfeng Zhang
Northeastern University, China
Database SystemsMachine Learning Systems
M
Mingwei Sun
Wuhan University, Huawei Riemann Lab
R
RΓΌdiger Westermann
TU Munich
Konrad Schindler
Konrad Schindler
Professor of Photogrammetry and Remote Sensing, ETH Zurich
PhotogrammetryRemote SensingImage AnalysisComputer Vision
L
Liqiu Meng
TU Munich