Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing text-to-3D indoor scene generation methods rely on irreversible sequential decision-making, which is prone to error accumulation. This work reframes the task as a planning problem constrained by spatial and layout commonsense knowledge and introduces a global–local dual-layer Monte Carlo Tree Search (MCTS) framework. The approach leverages a hierarchical scene representation—spanning rooms, regions, floor objects, and supporting objects—and integrates a PRM-guided MCTS to enable backtracking and multi-level layout decisions. Texture generation is handled by a pretrained diffusion model. Experiments on 3DTindo-bench, a newly curated large-scale dataset, demonstrate that the proposed method significantly outperforms state-of-the-art approaches in both visual realism and layout plausibility.

📝 Abstract

Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation.In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense.To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches.In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree.To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method.The hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object level.The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts.In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters.To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene.As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.

Problem

Research questions and friction points this paper is trying to address.

text-to-3D

indoor scene generation

error propagation

spatial commonsense

sequential decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-Local Monte Carlo Tree Search

Vision-Language Models

Text-to-3D Generation