🤖 AI Summary
Existing text-driven 3D indoor scene generation methods lack an efficient and editable intermediate representation, making it difficult to precisely control local geometry and semantics. This work proposes modeling scenes as a structured program synthesis and local refinement task, introducing a hierarchical domain-specific language (HDSL) inspired by XML/CSS that recursively represents complex scenes through tree structures and local coordinate systems. Building upon this representation, we integrate a large language model (LLM) agent, hierarchical retrieval-augmented generation (HRAG), and a three-way merging mechanism to achieve high-fidelity generation and efficient local editing. Experiments demonstrate that our approach significantly improves object coverage and text-scene alignment on standard benchmarks, reduces token consumption by 5.22× and runtime by 6.19× in editing tasks, and consistently produces valid DSL edits while better preserving unrelated objects.
📝 Abstract
Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.