🤖 AI Summary
Current vision-language models struggle to infer the minimal successful action sequence required to complete a task from exploratory manipulation videos and proprioceptive data that include failed attempts, particularly when such failures reveal implicit preconditions. To address this challenge, this work introduces the Exploratory Manipulation Trajectory Question Answering (EMT-QA) task and proposes a novel closed-loop trajectory distillation framework. This framework compresses task-specific reasoning logic into single-line natural language heuristic prompts, which guide a frozen vision-language model to accurately parse manipulation trajectories without updating model weights or invoking external agents. Evaluated across three simulated and two real-world robotic tasks, the method improves action-sequence prediction accuracy by 0.38–0.47 over the strongest baseline. Moreover, the distilled heuristics directly yield high-performance zero-shot programmatic classifiers.
📝 Abstract
Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination.
We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.