Robots Need More than VLA and World Models

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current general-purpose robot learning relies excessively on policy scaling and struggles to effectively leverage the vast amounts of unstructured human behavioral and environment interaction data. This work presents, for the first time, a systematic framework centered on four key interfaces: automated data annotation, embodied action transfer, physical world modeling, and task reward inference. By integrating foundation models, video-based learning, and reward modeling, the proposed multimodal collaborative learning framework transcends conventional paradigms that depend solely on robotic demonstrations or scaled-up vision-language-action (VLA) models and world models. Emphasizing cross-modal semantic alignment and cross-embodiment knowledge transfer, this study establishes a theoretical and methodological foundation for enabling robots to learn autonomously from open-world experiences.

📝 Abstract

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

Problem

Research questions and friction points this paper is trying to address.

robot learning

unstructured behavioral data

embodied supervision

generalist robot intelligence

reward structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models

unstructured behavioral data

embodiment retargeting