Planning and Reasoning with 3D Deformable Objects for Hierarchical Text-to-3D Robotic Shaping

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of autonomous 3D shaping of deformable objects (e.g., clay) from natural language instructions without explicit 3D target supervision. We propose the first end-to-end real-world text-to-3D clay sculpting system, operating in two stages: coarse-grained block placement and fine-grained local deformation. Our contributions are threefold: (1) the first use of large language models (LLMs) to hierarchically decompose high-level textual commands into semantically grounded sub-goals for deformable object manipulation; (2) a novel point-cloud region-driven action prediction paradigm integrating a region encoder, action prediction network, and multimodal similarity evaluation framework; and (3) a dedicated quantitative evaluation protocol tailored to text-to-3D shaping tasks. Experiments on a physical robot platform demonstrate that our system reliably generates diverse 3D clay sculptures from natural language inputs. Quantitative and human evaluation confirm that our metrics better align with perceptual consistency than conventional text-to-image or text-to-point-cloud benchmarks.

Technology Category

Application Category

📝 Abstract
Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task. For experimental videos, human evaluation details, and full prompts, please see our project website: https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting
Problem

Research questions and friction points this paper is trying to address.

Develop autonomous robotic system for deformable object manipulation.
Create coarse-to-fine sculpting system using text-to-3D shaping pipeline.
Quantify success in text-to-3D sculpting using similarity metrics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine autonomous sculpting system
Large language models for sub-goal generation
Text-to-3D shaping without explicit 3D goals
🔎 Similar Papers
No similar papers found.
Alison Bartsch
Alison Bartsch
PhD from Carnegie Mellon University
RoboticsManipulationDeep LearningComputer Vision
A
A. Farimani
Department of Mechanical Engineering, Carnegie Mellon University