🤖 AI Summary
This work addresses the limitation of existing RGB-based multi-view reconstruction methods, which produce monolithic scene representations lacking explicit physical structure and thus hinder stable physical interaction. The authors propose an end-to-end reconstruction framework that relies solely on RGB images and introduces gravity as a universal physical prior. By aligning views in a gravity-consistent coordinate system, reconstructing object-centric rigid-body meshes, and employing conditional 3D point classification to remove background redundancy, the method decouples foreground objects from background geometry without requiring CAD model retrieval. The output is a structured hybrid representation suitable for simulation. Experiments demonstrate significant improvements over retrieval-based baselines in 6-DoF object pose accuracy, decoupling quality, and rendering-to-simulation efficiency, both in simulated and real-world scenes.
📝 Abstract
Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.