MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Texture distortion in urban mesh models hinders their application in virtual navigation and autonomous driving simulation. To address this, we propose a three-stage diffusion framework that leverages urban meshes as geometric priors to jointly optimize cross-view consistency and stylistic coherence. First, a cascaded outpainting ControlNet guides large-scale scene expansion. Second, AGInpaint enables high-fidelity intermediate-view interpolation. Third, the GCAlign module performs global exposure and geometric alignment; notably, we introduce the first joint optimization of ControlNet and 3D Gaussian Splatting to ensure both geometric fidelity and visual continuity. Experiments demonstrate that our method significantly outperforms existing approaches in texture quality, cross-view consistency, and multi-style controllability, enabling scalable, high-fidelity urban scene synthesis.

Technology Category

Application Category

📝 Abstract

Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.

Problem

Research questions and friction points this paper is trying to address.

Generate realistic textures for city mesh models lacking visual realism

Achieve cross-view consistency in 3D scene generation using diffusion models

Maintain geometric alignment while ensuring visual quality in outdoor scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mesh-guided diffusion for scene synthesis

Cascaded ControlNets for geometric consistency

3D Gaussian Splatting for scene reconstruction

🔎 Similar Papers

No similar papers found.

Authors to Follow