🤖 AI Summary
This work proposes a scalable framework for task-conditioned video generation and robotic execution that operates without paired task-video supervision. Existing pretrained video generation models rely on detailed textual descriptions, limiting their applicability to high-level task planning and decision-making, while current approaches often require costly task-execution videos or external language model supervision. The proposed method first leverages a vision-language model (VLM) to automatically generate tasks and step-by-step solutions from unlabeled images, which then guide a pretrained video diffusion model (Demonstrator) to synthesize demonstration videos. Subsequently, through self-distillation, the learned behaviors are transferred to a lightweight Executor model that operates solely on images and short instructions, further refined via reinforcement learning with VLM-based feedback. This approach achieves the first successful knowledge transfer from caption-guided video generation to instruction-conditioned task solving, outperforming the Demonstrator under VLM evaluation protocols on WorldTasks and DreamGen benchmarks and demonstrating strong real-world robotic transfer capabilities.
📝 Abstract
Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.