🤖 AI Summary
This work addresses the limited scalability of native unified multimodal models for cross-modal understanding and generation. We propose Show-o2, the first unified architecture natively supporting joint modeling of text, images, and videos. Methodologically, we introduce a 3D causal variational autoencoder to construct a shared latent space, coupled with a dual-path spatiotemporal fusion mechanism for efficient cross-modal representation learning. We further design a novel hybrid decoding architecture—combining an autoregressive language head and a flow-matching flow head—and formulate a two-stage joint training paradigm that simultaneously optimizes both understanding and generation capabilities within a single framework. Experiments demonstrate that Show-o2 achieves state-of-the-art performance across diverse multimodal understanding and generation benchmarks, while enabling zero-shot cross-modal transfer. The code and models are publicly released.
📝 Abstract
This paper presents improved native unified multimodal models, emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.