$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Existing robotic manipulation approaches struggle to jointly model visual prediction, action generation, and task evaluation within a unified framework. This work proposes the first unified prospective world model that integrates all three components through a shared video diffusion backbone, enabling simultaneous multi-view video prediction, continuous action chunk generation, and task progress assessment. The model employs a dual-interface architecture to support both action generation and simulation-based correction, and introduces a re-denoising consistency mechanism to rank action candidates. Modality-specific supervised masks are leveraged to fuse multi-view images, language instructions, and robot states. Evaluated on complex long-horizon and fine-grained manipulation tasks, the approach significantly outperforms current baselines, demonstrating its efficacy and robustness in real-world robotic scenarios.
📝 Abstract
Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.
Problem

Research questions and friction points this paper is trying to address.

robotic manipulation
action prediction
future consequence evaluation
world model
video-action modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
video diffusion
action evaluation
robotic manipulation
test-time computation
P
Pengfei Zhou
AGIBOT Finch
Shengcong Chen
Shengcong Chen
Unknown affiliation
World ModelComputer VisionEmbodied AIMedical Image Analysis
D
Di Chen
AGIBOT Finch
J
Jiaxu Wang
AGIBOT Finch
R
Rongjun Jin
AGIBOT Finch
B
Bingwen Zhu
Shanghai Innovation Institute, AGIBOT Finch
Yike Pan
Yike Pan
University of Michigan
Embodied AI
Songen Gu
Songen Gu
UCAS
Robotics3D Vision
K
Kuanning Wang
AGIBOT Finch
S
Shufeng Nan
AGIBOT Finch
Xingyu Qiu
Xingyu Qiu
Harbin Institute of Technology
Medical Image AnalysisGenerative AI
C
Chenhao Qiu
AGIBOT Finch
P
Pu Yang
AGIBOT Finch
Y
Yunuo Cai
Shanghai Innovation Institute, AGIBOT Finch
J
Jianxiong Gao
AGIBOT Finch
Y
Yifan Li
Shanghai Innovation Institute
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning
Z
Zhi Chen
AGIBOT Finch
Jianlan Luo
Jianlan Luo
UC Berkeley, Google X
RoboticsMachine LearningArtificial Intelligence