MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

4D reconstruction of dynamic human activities (e.g., car repair, dancing) from extremely sparse views—such as four equidistant static monocular cameras—remains challenging, as conventional dense multi-view approaches fail due to insufficient viewpoint overlap. Method: We propose a time-view joint-consistency reconstruction framework that fuses per-view monocular human reconstructions via pose alignment, temporal consistency optimization, and multi-view geometric constraints to enable cross-view and cross-frame collaborative modeling. Contribution/Results: To our knowledge, this is the first method achieving high-fidelity, temporally coherent 4D reconstruction under such ultra-sparse configurations—without depth sensors or dense sampling. Evaluated on Panoptic Studio and Ego-Exo4D benchmarks, it significantly outperforms state-of-the-art methods: novel-view synthesis exhibits markedly improved quality, richer geometric detail, and more stable motion trajectories.

Technology Category

Application Category

📝 Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.

Problem

Research questions and friction points this paper is trying to address.

Dynamic scene reconstruction from sparse-view videos

Overcoming limitations of dense multi-view setups

Aligning monocular reconstructions for view consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns monocular reconstructions for consistency

Uses sparse-view cameras for dynamic scenes

Improves novel view rendering quality

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View