OSCAR: Omni-Embodiment Skeleton-Conditioned World Action Model for Robotics

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
This work addresses the limitations of existing robotic video world models—namely insufficient data diversity in real-world scenarios, inaccurate action conditioning, and poor cross-embodiment generalization—by proposing a unified 2D motion skeleton as a generic conditioning representation. The authors establish a large-scale, standardized data processing pipeline to integrate and curate multi-source first-person videos from diverse robots and humans. Building upon the Cosmos-Predict2.5-2B architecture, they fine-tune a high-fidelity world model with strong action-following capabilities. The resulting model significantly improves action accuracy, visual quality, and motion consistency, enabling policy generalization across different robot embodiments and even to human hands. Notably, this study provides the first empirical validation that policy evaluation in purely simulated environments correlates strongly with real-world performance, establishing a new paradigm for low-cost robotic policy development.
📝 Abstract
We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.
Problem

Research questions and friction points this paper is trying to address.

video world model
robot embodiment generalization
action-conditioned generation
robot policy evaluation
scenario diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
embodiment generalization
skeleton conditioning
robot policy evaluation
action-conditioned video generation