🤖 AI Summary
Existing monocular dynamic video modeling methods produce only 2D frames and lack generalization to unseen scenes. Method: We propose the first egocentric, unbounded tri-plane representation for 4D scene modeling from monocular video. To address temporal coherence and geometric-semantic joint learning without explicit geometry supervision, we introduce a 4D-aware Transformer for self-supervised temporal feature aggregation, integrated with dynamic radiance field optimization over implicit tri-plane representations. Contribution/Results: Our approach achieves state-of-the-art performance on the NVIDIA Dynamic Scenes dataset, demonstrates strong cross-scene generalization, and—crucially—enables large-scale, self-supervised 4D reconstruction of the physical world directly from monocular video, marking the first such capability.
📝 Abstract
We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.