🤖 AI Summary
This work addresses the challenge of autonomous 3D shaping of deformable objects (e.g., clay) from natural language instructions without explicit 3D target supervision. We propose the first end-to-end real-world text-to-3D clay sculpting system, operating in two stages: coarse-grained block placement and fine-grained local deformation. Our contributions are threefold: (1) the first use of large language models (LLMs) to hierarchically decompose high-level textual commands into semantically grounded sub-goals for deformable object manipulation; (2) a novel point-cloud region-driven action prediction paradigm integrating a region encoder, action prediction network, and multimodal similarity evaluation framework; and (3) a dedicated quantitative evaluation protocol tailored to text-to-3D shaping tasks. Experiments on a physical robot platform demonstrate that our system reliably generates diverse 3D clay sculptures from natural language inputs. Quantitative and human evaluation confirm that our metrics better align with perceptual consistency than conventional text-to-image or text-to-point-cloud benchmarks.
📝 Abstract
Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task. For experimental videos, human evaluation details, and full prompts, please see our project website: https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting