๐ค AI Summary
This work addresses the challenge of automatically generating physically realizable 3D block structures from a single, informal 2D front-view sketchโwithout requiring domain expertise or precise geometric input. The proposed method constructs an abstract relational graph that explicitly encodes symbolic geometric constraints and stability patterns, and employs a compositional diffusion model to iteratively infer implicit support structures. Unlike existing approaches reliant on accurate 3D annotations or professional CAD tools, our framework accepts crude, child-level hand-drawn sketches and outputs multi-layered, mechanically stable 3D architectures (e.g., landmarks and residential buildings). Quantitative and qualitative evaluations demonstrate significant improvements over baseline methods in both structural stability and sketch-to-3D visual fidelity.
๐ Abstract
Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today's robot manipulation systems can't act on such sketches directly-they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present StackItUp, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. StackItUp introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., left-of) and stability patterns (e.g., two-pillar-bridge) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports-critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, StackItUp consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.