Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of achieving both high fidelity and global consistency in 4D dynamic scene reconstruction from monocular video, a task inherently ill-posed due to severe ambiguities. To this end, we propose a multi-scale dynamics decomposition framework that models complex motion fields across hierarchical levels—from objects down to particles—and introduce a dynamic 3D Gaussian sequence representation to capture fine-grained spatiotemporal structures. Complementary supervision from multimodal priors provided by vision foundation models is integrated to alleviate monocular ambiguity and enforce physically plausible dynamics. Experiments demonstrate that our method significantly outperforms existing approaches on both benchmark and real-world manipulation datasets, enabling high-fidelity, globally consistent 4D reconstructions and dynamic novel-view synthesis.

Technology Category

Application Category

📝 Abstract

Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.

Problem

Research questions and friction points this paper is trying to address.

4D reconstruction

monocular video

dynamic scenes

ill-posed problem

casual videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale dynamics

Gaussian sequences

4D reconstruction