FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limited expressiveness and poor generalization inherent in residual-prediction paradigms within visual autoregressive modeling. We propose FlexVAR, a flexible, residual-free, fully pixel-supervised autoregressive image generation framework. Its core innovation lies in abandoning residual modeling entirely and instead performing block-wise sequential autoregressive prediction directly over raw pixel values as ground truth—enabling arbitrary resolution, aspect ratio, and generation step count, while natively supporting diverse editing tasks. Trained exclusively on images ≤256×256, FlexVAR achieves zero-shot generalization to higher resolutions (e.g., 512×512). On ImageNet 256×256, our 1.0B-parameter model attains a FID of 2.08, substantially outperforming prior autoregressive models (AiM/VAR) and diffusion-based approaches (LDM/DiT). Remarkably, its zero-shot performance at 512×512 matches that of a supervised 2.3B-parameter VAR model.

Technology Category

Application Category

📝 Abstract

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ($leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256$ imes$256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512$ imes$512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512$ imes$512 resolution.

Problem

Research questions and friction points this paper is trying to address.

Challenges residual prediction in visual autoregressive modeling.

Introduces FlexVAR for flexible image generation tasks.

Enables high-quality images from low-resolution training data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ground-truth prediction enables flexible image generation

Supports multi-resolution and aspect ratio generation

Adapts to varying autoregressive steps for optimization

🔎 Similar Papers

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches