World Model Self-Distillation: Training World Models to Solve General Tasks

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a scalable framework for task-conditioned video generation and robotic execution that operates without paired task-video supervision. Existing pretrained video generation models rely on detailed textual descriptions, limiting their applicability to high-level task planning and decision-making, while current approaches often require costly task-execution videos or external language model supervision. The proposed method first leverages a vision-language model (VLM) to automatically generate tasks and step-by-step solutions from unlabeled images, which then guide a pretrained video diffusion model (Demonstrator) to synthesize demonstration videos. Subsequently, through self-distillation, the learned behaviors are transferred to a lightweight Executor model that operates solely on images and short instructions, further refined via reinforcement learning with VLM-based feedback. This approach achieves the first successful knowledge transfer from caption-guided video generation to instruction-conditioned task solving, outperforming the Demonstrator under VLM evaluation protocols on WorldTasks and DreamGen benchmarks and demonstrating strong real-world robotic transfer capabilities.
📝 Abstract
Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.
Problem

Research questions and friction points this paper is trying to address.

world models
video generation
task-solving
self-distillation
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
self-distillation
reinforcement learning
video diffusion model
vision-language model
🔎 Similar Papers
No similar papers found.