🤖 AI Summary
This study investigates the limitations of multimodal large language models (MLLMs) in visual grounding and spatial reasoning for block assembly tasks, formulating the problem as a sequential decision-making process involving block selection and pose estimation. To address these challenges, the authors propose Brick-Composer, a framework that integrates human demonstrations, environmental feedback, and synthetic data into a multi-signal learning paradigm. They also introduce BC-Bench, the first evaluation benchmark tailored to diverse block assembly scenarios. Experimental results demonstrate that the proposed approach improves block selection accuracy by more than threefold, substantially reduces pose estimation error, and increases the strict step-level success rate from below 1% to approximately 15%, with an average of 42% of steps executed correctly.
📝 Abstract
We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.