🤖 AI Summary
Existing generative pretraining algorithms are constrained by two dominant paradigms—autoregressive (discrete) and diffusion-based (continuous)—which struggle to jointly achieve strong multimodal modeling capability and efficient inference, leading to scalability bottlenecks. To address this, we propose a novel “inference-first” paradigm, prioritizing scalability in sequence length and refinement steps. We introduce Inductive Moment Matching (IMM) into generative pretraining for the first time, enabling a single-stage, inherently stable, and multi-step-sampling-free generation algorithm. Our method unifies IMM theory, improved diffusion mechanisms, and inference-time temporal modeling within an end-to-end, multimodal data-driven training framework. Experiments demonstrate that our approach achieves state-of-the-art sample quality comparable to advanced diffusion models, accelerates inference by over 10×, and significantly improves multimodal generation consistency and training stability.
📝 Abstract
Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.