Show-o2: Improved Native Unified Multimodal Models

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limited scalability of native unified multimodal models for cross-modal understanding and generation. We propose Show-o2, the first unified architecture natively supporting joint modeling of text, images, and videos. Methodologically, we introduce a 3D causal variational autoencoder to construct a shared latent space, coupled with a dual-path spatiotemporal fusion mechanism for efficient cross-modal representation learning. We further design a novel hybrid decoding architecture—combining an autoregressive language head and a flow-matching flow head—and formulate a two-stage joint training paradigm that simultaneously optimizes both understanding and generation capabilities within a single framework. Experiments demonstrate that Show-o2 achieves state-of-the-art performance across diverse multimodal understanding and generation benchmarks, while enabling zero-shot cross-modal transfer. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

This paper presents improved native unified multimodal models, emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Problem

Research questions and friction points this paper is trying to address.

Develop unified multimodal models for diverse tasks

Enable scalable image and video understanding and generation

Integrate autoregressive and flow matching techniques natively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive modeling for text token prediction

Flow matching for image and video generation

3D causal variational autoencoder space

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs