🤖 AI Summary
This work addresses the challenges of depth ambiguity in monocular dynamic-camera 4D reconstruction and spatio-temporal inconsistency across views and time in sparse dynamic multi-camera setups. To this end, it presents the first 4D reconstruction framework tailored for sparse dynamic cameras. The approach achieves temporally and spatially consistent 3D trajectory initialization by fusing cross-camera feature matching with single-camera point tracking. It further enhances reconstruction quality through a robust depth-order regularization loss and a spatio-temporally diverse batch sampling strategy. Key contributions include the proposed framework, the consistency-aware initialization method, and LetCamsGo—the first real-world benchmark dataset for this setting. Experiments demonstrate that the method significantly improves reconstruction fidelity in dynamic regions on LetCamsGo, validating the feasibility of low-cost sparse dynamic camera configurations in real-world scenarios.
📝 Abstract
Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.