Goku: Flow Based Video Generative Foundation Models

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of jointly generating high-fidelity images and videos while maintaining training stability in multimodal spatiotemporal generative modeling. We propose the first end-to-end trainable Rectified Flow Transformer architecture tailored for visual generation, systematically introducing the rectified flow paradigm into unified text-to-image and text-to-video modeling. To support this framework, we design a large-scale vision-language alignment data curation pipeline and a distributed streaming training infrastructure specifically optimized for spatiotemporal data. Our approach achieves state-of-the-art performance on three major benchmarks: GenEval (0.76), DPG-Bench (83.65), and VBench (84.85)—surpassing prior methods in both generation quality and training robustness. The model demonstrates improved convergence behavior, reduced mode collapse, and enhanced fidelity across spatial and temporal dimensions, establishing a new foundation for unified visual generative modeling.

Technology Category

Application Category

📝 Abstract

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

Problem

Research questions and friction points this paper is trying to address.

Develop joint image-video generation models

Achieve industry-leading visual generation performance

Set new benchmarks in text-to-video tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified flow Transformers

Joint image-video generation

Efficient large-scale training

🔎 Similar Papers

No similar papers found.

Authors to Follow