🤖 AI Summary
Autoregressive visual generation models face a fundamental trade-off: discrete token modeling incurs information loss and reduced fidelity, while continuous token modeling suffers from high-dimensional density estimation challenges and out-of-distribution artifacts. This paper introduces DisCon, the first framework to treat discrete tokens as *conditional signals*—rather than generation targets—for continuous representation synthesis. DisCon models the conditional probability distribution of a continuous latent space given discrete token constraints, thereby preserving image information integrity while ensuring generation stability. By decoupling discrete semantics from continuous reconstruction, the approach avoids vector quantization distortion and mitigates generalization risks inherent in purely continuous autoregressive modeling. Evaluated on ImageNet 256×256, DisCon achieves a state-of-the-art gFID of 1.38, substantially outperforming existing autoregressive methods. This work establishes a novel paradigm for high-fidelity autoregressive image generation.
📝 Abstract
Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces DisCon (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of 1.38 on ImageNet 256$ imes$256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.