Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Monocular video-based dynamic scene reconstruction is often hindered by Gaussian redundancy, view-dependent bias, and difficulties in modeling global motion. This work proposes the C4G framework, which introduces timestamp-conditioned, compact learnable Gaussian query tokens to aggregate features across the full temporal context and decode 3D Gaussians in a feed-forward manner, enabling 4D reconstruction without per-scene optimization. The method operates without requiring camera pose inputs, significantly improving motion consistency and robustness to large temporal intervals. By integrating a video diffusion model, it further enhances rendering fidelity. Notably, it achieves superior novel-view synthesis with fewer Gaussians and extends to a 4D feature field to support point tracking and dynamic semantic understanding.

📝 Abstract

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

Problem

Research questions and friction points this paper is trying to address.

4D reconstruction

dynamic scene

monocular video

global motion

3D Gaussians

Innovation

Methods, ideas, or system contributions that make the work stand out.

compact Gaussians

timestamp-conditioned query tokens

feed-forward 4D reconstruction