🤖 AI Summary
Existing methods struggle to jointly reconstruct dynamic humans, static scenes, and camera poses from multi-view videos, often resulting in geometric inconsistencies, unstable motion, and physically implausible trajectories. This work proposes TROPHIES, a unified framework that introduces the first formulation of the joint human-scene-camera 4D reconstruction task. It employs a dual-branch architecture: one branch models dynamic humans with temporal-spatial reasoning, while the other captures static scenes enhanced by human-aware attention. A global alignment optimization module couples these branches, integrating scale consistency, contact priors, and cross-view temporal constraints to achieve physically plausible 4D reconstructions in a global coordinate system. Evaluated on the EgoHuman and EgoExo4D datasets, TROPHIES significantly outperforms existing approaches in terms of global fidelity and human-scene consistency.
📝 Abstract
Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.