SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

📅 2026-01-12
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently generating semantically coherent and spatially accurate full 3D indoor scenes from natural language instructions. To this end, the authors propose a single-stage, non-autoregressive Transformer model that directly synthesizes scenes from text through parallel decoding and a fully discretized semantic-spatial representation. Key innovations include a dual masking strategy operating at both attribute and instance levels, as well as a learnable mapping mechanism that translates relational queries into symbolic triplets, significantly enhancing inter-object relationship modeling. Experiments on the 3D-FRONT dataset demonstrate that the proposed method outperforms existing autoregressive and diffusion-based approaches in both semantic plausibility and spatial layout accuracy, while substantially reducing computational overhead.

Technology Category

Application Category

📝 Abstract
We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.
Problem

Research questions and friction points this paper is trying to address.

language-guided scene synthesis
3D indoor scene generation
semantic compliance
spatial arrangement
natural language instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked generative modeling
non-autoregressive Transformer
language-guided scene synthesis
triplet-based relational reasoning
discretized 3D representation
🔎 Similar Papers
No similar papers found.
Jeongjun Choi
Jeongjun Choi
Seoul National University
RoboticsDeep Learning
Y
Yeonsoo Park
Seoul National University
H
H. J. Kim
Seoul National University