Goku: Flow Based Video Generative Foundation Models

πŸ“… 2025-02-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of jointly generating high-fidelity images and videos while maintaining training stability in multimodal spatiotemporal generative modeling. We propose the first end-to-end trainable Rectified Flow Transformer architecture tailored for visual generation, systematically introducing the rectified flow paradigm into unified text-to-image and text-to-video modeling. To support this framework, we design a large-scale vision-language alignment data curation pipeline and a distributed streaming training infrastructure specifically optimized for spatiotemporal data. Our approach achieves state-of-the-art performance on three major benchmarks: GenEval (0.76), DPG-Bench (83.65), and VBench (84.85)β€”surpassing prior methods in both generation quality and training robustness. The model demonstrates improved convergence behavior, reduced mode collapse, and enhanced fidelity across spatial and temporal dimensions, establishing a new foundation for unified visual generative modeling.

Technology Category

Application Category

πŸ“ Abstract
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.
Problem

Research questions and friction points this paper is trying to address.

Develop joint image-video generation models
Achieve industry-leading visual generation performance
Set new benchmarks in text-to-video tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified flow Transformers
Joint image-video generation
Efficient large-scale training
πŸ”Ž Similar Papers
No similar papers found.
Shoufa Chen
Shoufa Chen
The University of Hong Kong (HKU)
Computer VisionDeep Learning
C
Chongjian Ge
The University of Hong Kong
Y
Yuqi Zhang
Bytedance Inc
Y
Yida Zhang
Bytedance Inc
Fengda Zhu
Fengda Zhu
Monash University
Deep LearningReinforcement LearningComputer Vision
H
Hao Yang
Bytedance Inc
H
Hongxiang Hao
Bytedance Inc
H
Hui Wu
Bytedance Inc
Z
Zhichao Lai
Bytedance Inc
Y
Yifei Hu
Bytedance Inc
T
Ting-Che Lin
Bytedance Inc
Shilong Zhang
Shilong Zhang
University of Hong Kong
AIGCMultimodal LLMs
F
Fu Li
Bytedance Inc
C
Chuan Li
Bytedance Inc
X
Xing Wang
Bytedance Inc
Yanghua Peng
Yanghua Peng
ByteDance Inc.
Large Language ModelsMachine Learning SystemsGPU Scheduling
Peize Sun
Peize Sun
Meta FAIR; HKU
Computer VisionDeep Learning
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing
Y
Yi Jiang
Bytedance Inc
Zehuan Yuan
Zehuan Yuan
Bytedance Inc.
Computer VisionMultimediaMachine Learning
Bingyue Peng
Bingyue Peng
Bytedance
Generative AI
X
Xiaobing Liu
Bytedance Inc