Cosmos 3: Omnimodal World Models for Physical AI

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

245K/year
πŸ€– AI Summary
This work proposes Cosmos 3, the first unified world model capable of end-to-end joint modeling of language, images, video, audio, and action sequences to support multimodal perception and decision-making for embodied agents in the physical world. Built upon a hybrid Transformer architecture, Cosmos 3 integrates visual-language understanding, video generation, world simulation, and action policy learning within a single framework. Scalability is achieved through large-scale synthetic data training and a unified input–output interface. The model achieves state-of-the-art performance across multiple multimodal understanding and generation benchmarks and has been recognized by Artificial Analysis as the top open-source text-to-image and image-to-video model, and by RoboArena as the leading policy model.
πŸ“ Abstract
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 https://openmdw.ai/license/1-1/ License at https://github.com/nvidia/cosmos}{github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3 .
Problem

Research questions and friction points this paper is trying to address.

omnimodal
world models
Physical AI
multimodal integration
embodied agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

omnimodal
world models
mixture-of-transformers
Physical AI
embodied agents
πŸ”Ž Similar Papers
A
Aditi
N
Niket Agarwal
Arslan Ali
Arslan Ali
Senior AI Applied Research Scientist, NVIDIA
AIDeepLearningGenerativeAI
J
Jon Allen
M
Martin Antolini
A
Adeline Aubame
A
Alisson Azzolini
J
Junjie Bai
M
Maciej Bala
Yogesh Balaji
Yogesh Balaji
Research Scientist at NVIDIA
Machine LearningComputer VisionArtificial Intelligence
J
Josh Bapst
A
Aarti Basant
M
Mukesh Beladiya
M
Mohammad Qazim Bhat
Z
Zaid Pervaiz Bhat
D
Dan Blick
V
Vanni Brighella
H
Han Cai
T
Tiffany Cai
E
Eric Cameracci
J
Jiaxin Cao
Yulong Cao
Yulong Cao
Research Scientist, NVIDIA Research; Ph.D. Umich
Trustworth AISystem SecurityCPS Security
Mark Carlson
Mark Carlson
Board of Governors of the Federal Reserve System
economicseconomic historymonetary economics
C
Carlos Casanova
T
Ting-Yun Chang