TensorAR: Refinement is All You Need in Autoregressive Image Generation

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive (AR) image generation suffers from limited quality due to its reliance on single-step causal decoding, lacking mechanisms for post-hoc refinement. To address this, we propose *next-tensor prediction*, a novel paradigm that iteratively refines previously generated content by predicting overlapping discrete tensor blocks via a sliding window. We introduce a codebook-index-driven discrete noising scheme and strictly causal masking during training to prevent information leakage. Our approach is implemented as a plug-and-play module compatible with mainstream AR architectures—including LlamaGEN and RAR—operating atop VQ tokenizers with a learned token-to-tensor mapping. Quantitative evaluation shows substantial improvements: FID decreases by 12.3%, CLIP Score increases by 4.1, and human preference scores rise significantly. Generated images exhibit richer fine-grained details and enhanced structural consistency.

Technology Category

Application Category

📝 Abstract
Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.
Problem

Research questions and friction points this paper is trying to address.

AR models lack refinement of previous predictions
Improving AR image generation via next-tensor prediction
Preventing information leakage with discrete tensor noising
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates AR as next-tensor prediction
Uses sliding windows for iterative refinement
Proposes discrete tensor noising scheme