SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches for generating high-fidelity, physically plausible, and semantically flexible indoor 3D scenes for embodied AI are limited by fixed object categories, insufficient geometric and textural detail, physical inconsistencies, and weak alignment with natural-language instructions. To address these challenges, we propose a reflective agent architecture that unifies data-driven and language-guided generation paradigms—introducing the first framework to integrate a large language model–based planner, a multimodal vision generator, a physics simulation–based verification module, and a self-feedback closed-loop system. Our method enables open-vocabulary category extension and fine-grained semantic refinement, driven by iterative, self-assessed optimization. Experiments demonstrate significant improvements over state-of-the-art methods across both common and open-vocabulary room types, achieving concurrent gains in physical plausibility, visual fidelity, and instruction alignment. The framework exhibits strong generalization capability and practical potential for real-world embodied applications.

Technology Category

Application Category

📝 Abstract
Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing 3D scenes that are visually realistic and physically plausible
Overcoming constraints of fixed scene categories and limited object detail
Aligning complex user instructions with functional and diverse environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework with tool-based iterative refinement
Language model planner selecting extensible generation tools
Closed-loop reason-act-reflect design for self-correction
🔎 Similar Papers
No similar papers found.
Yandan Yang
Yandan Yang
BIGAI (Beijing Institute for General Artificial Intelligence)
Computer VisionGenerationEmbodied AI
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
S
Shujie Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI, Tsinghua University
S
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI