🤖 AI Summary
Existing 3D indoor scene generation methods achieve notable progress in coarse-grained furniture layout but remain limited in modeling fine-grained object placement—particularly position, orientation, and collision avoidance—thereby compromising scene realism and physical plausibility. To address this, we propose the first top-down, coarse-to-fine hierarchical generation framework. Our approach first decouples vertical (layered) and horizontal (planar) layout relationships, then introduces a learnable layout optimization network that jointly refines object poses while enforcing physical feasibility constraints. This effectively mitigates layout errors and inter-object penetration. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, producing structurally coherent, geometrically detailed, and physically plausible 3D indoor scenes. The framework advances high-fidelity 3D environment synthesis for applications including virtual reality and embodied AI.
📝 Abstract
Realistic 3D indoor scene generation is crucial for virtual reality, interior design, embodied intelligence, and scene understanding. While existing methods have made progress in coarse-scale furniture arrangement, they struggle to capture fine-grained object placements, limiting the realism and utility of generated environments. This gap hinders immersive virtual experiences and detailed scene comprehension for embodied AI applications. To address these issues, we propose Hierarchical Layout Generation (HLG), a novel method for fine-grained 3D scene generation. HLG is the first to adopt a coarse-to-fine hierarchical approach, refining scene layouts from large-scale furniture placement to intricate object arrangements. Specifically, our fine-grained layout alignment module constructs a hierarchical layout through vertical and horizontal decoupling, effectively decomposing complex 3D indoor scenes into multiple levels of granularity. Additionally, our trainable layout optimization network addresses placement issues, such as incorrect positioning, orientation errors, and object intersections, ensuring structurally coherent and physically plausible scene generation. We demonstrate the effectiveness of our approach through extensive experiments, showing superior performance in generating realistic indoor scenes compared to existing methods. This work advances the field of scene generation and opens new possibilities for applications requiring detailed 3D environments. We will release our code upon publication to encourage future research.