🤖 AI Summary
This work addresses the challenge of enabling robots to interpret ambiguous high-level instructions (e.g., “set the table for two”) and generate functionally valid object arrangements. We propose a few-shot scene layout generation method that uniquely leverages abstract spatial relation graphs—parsed by large language models (LLMs)—as structured geometric constraints. These graphs are integrated with a modular diffusion model to solve for physically realizable object poses, while program synthesis and constraint satisfaction jointly optimize for functional validity, physical stability, and visual aesthetics. Evaluated on learning desk, dining table, and coffee table scenes, our approach significantly outperforms existing baselines. Crucially, it achieves semantically correct, stable, and aesthetically coherent layouts using only a small number of demonstration examples and lightweight program sketches—without requiring extensive training data or manual rule engineering.
📝 Abstract
This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as"set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as"put object A on the table."We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.