AutoScape: Geometry-Consistent Long-Horizon Scene Generation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address geometric inconsistency and poor temporal coherence in long-horizon autonomous driving video generation, this paper proposes a two-stage RGB-D diffusion framework. First, RGB and depth modalities are jointly modeled in a shared latent space, with explicit point-cloud representation enforcing scene geometry; a deformation-consistency guidance mechanism ensures sparse keyframes that are strictly geometrically consistent. Second, these keyframes serve as anchors to drive a video diffusion model for dense frame interpolation. This is the first method enabling end-to-end, geometrically consistent synthesis of long-duration (>20 seconds) driving videos. It achieves new state-of-the-art performance, outperforming prior methods by 48.6% on long-horizon FID and 43.0% on FVD, significantly enhancing visual realism and structural stability.

Technology Category

Application Category

📝 Abstract
This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Generating long-horizon geometrically consistent driving scenes
Maintaining geometric consistency across sparse keyframes and dense videos
Improving realism and coherence in extended driving video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-D diffusion model generates sparse keyframes
Shared latent space handles image and depth
Video diffusion model interpolates dense video frames
🔎 Similar Papers
No similar papers found.
J
Jiacheng Chen
Simon Fraser University
Ziyu Jiang
Ziyu Jiang
Research Assistant, Texas A&M University
Semantic SegmentationEfficient TrainingIndoor Scene Occlusion Reasoning
Mingfu Liang
Mingfu Liang
Meta | Northwestern University
Machine LearningComputer VisionIncremental LearningContinual Learning
B
Bingbing Zhuang
NEC Labs America
Jong-Chyi Su
Jong-Chyi Su
NEC Labs America
S
Sparsh Garg
NEC Labs America
Y
Ying Wu
Northwestern University
M
M. Chandraker
NEC Labs America, UC San Diego