🤖 AI Summary
This work addresses the challenge of generating dense, irregular interactive 3D tabletop scenes while avoiding physical collisions and violations commonly encountered in existing methods. The authors propose a human-like construction paradigm that leverages a Cognitive Topological Reasoning Chain (CTRC) to enable anchor-guided, sequential scene synthesis. To enhance physical plausibility, they introduce a Physics-Aware Denoising Alignment (PADA) mechanism. By integrating 3D axis-aligned bounding box (AABB) layouts, differentiable signed distance fields, test-time optimization, and end-to-end training, the method preserves semantic intent while significantly improving physical consistency. Experimental results demonstrate a 40% reduction in scene-level collision rates compared to human-annotated data, with both semantic accuracy and physical validity surpassing current state-of-the-art approaches.
📝 Abstract
Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.