Predicting 3D representations for Dynamic Scenes

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing monocular dynamic video modeling methods produce only 2D frames and lack generalization to unseen scenes. Method: We propose the first egocentric, unbounded tri-plane representation for 4D scene modeling from monocular video. To address temporal coherence and geometric-semantic joint learning without explicit geometry supervision, we introduce a 4D-aware Transformer for self-supervised temporal feature aggregation, integrated with dynamic radiance field optimization over implicit tri-plane representations. Contribution/Results: Our approach achieves state-of-the-art performance on the NVIDIA Dynamic Scenes dataset, demonstrates strong cross-scene generalization, and—crucially—enables large-scale, self-supervised 4D reconstruction of the physical world directly from monocular video, marking the first such capability.

Technology Category

Application Category

📝 Abstract
We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
Problem

Research questions and friction points this paper is trying to address.

3D scene prediction
lighting variation
shape and meaning understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Light Field Prediction
4D-aware Intelligent Tool
Self-trained Monocular Video Analysis
🔎 Similar Papers
No similar papers found.
Di Qi
Di Qi
Purdue University
applied and computational mathematics
T
Tong Yang
MEGVII Technology
B
Beining Wang
Fudan University
X
Xiangyu Zhang
MEGVII Technology
W
Wenqiang Zhang
Fudan University