🤖 AI Summary
Traditional 3D layout methods heavily rely on geometric constraints and hand-crafted rules, failing to model high-level semantics—such as social interactions, cultural norms, and usage habits—resulting in poor generalization and frequent orientation errors. To address this, we propose the first four-tiered contextual framework (physical → functional → social → cultural) and a lightweight vision-language model (VLM)-driven closed-loop iterative layout paradigm. Our method automatically triggers diagnosis-and-correction cycles via minimal visual prompting, eliminating the need for manual hyperparameter tuning, large-scale annotations, or explicit rule encoding. Integrated multi-level contextual validation modules ensure rigorous adherence to spatial, orientational, and semantic constraints. Quantitative evaluation demonstrates substantial improvements over native VLM baselines in rotation accuracy, inter-object distance control, and overall layout plausibility. Notably, our approach achieves fully automated, regulation-compliant 3D scene composition across everyday settings as well as domain-specific contexts—including religious and ceremonial environments—for the first time.
📝 Abstract
3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods often rely on rule-based heuristics or narrowly trained learning models, making them difficult to generalize and frequently prone to orientation errors that break realism. To address these challenges, we define four escalating context levels, ranging from straightforward physical placement to complex cultural requirements such as religious customs and advanced social norms. We then propose a Vision-Language Model-based pipeline that inserts minimal visual cues for orientation guidance and employs iterative feedback to pinpoint, diagnose, and correct unnatural placements in an automated fashion. Each adjustment is revisited through the system's verification process until it achieves a coherent result, thereby eliminating the need for extensive user oversight or manual parameter tuning. Our experiments across these four context levels reveal marked improvements in rotation accuracy, distance control, and overall layout plausibility compared with native VLM. By reducing the dependence on pre-programmed constraints or prohibitively large training sets, our method enables fully automated scene composition for both everyday scenarios and specialized cultural tasks, moving toward a universally adaptable framework for 3D arrangement.