🤖 AI Summary
Monocular video-based dynamic scene reconstruction is often hindered by Gaussian redundancy, view-dependent bias, and difficulties in modeling global motion. This work proposes the C4G framework, which introduces timestamp-conditioned, compact learnable Gaussian query tokens to aggregate features across the full temporal context and decode 3D Gaussians in a feed-forward manner, enabling 4D reconstruction without per-scene optimization. The method operates without requiring camera pose inputs, significantly improving motion consistency and robustness to large temporal intervals. By integrating a video diffusion model, it further enhances rendering fidelity. Notably, it achieves superior novel-view synthesis with fewer Gaussians and extends to a 4D feature field to support point tracking and dynamic semantic understanding.
📝 Abstract
Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.