Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing test-time scaling (TTS) methods rely on partial decoding and external reward models, making them incompatible with next-token-prediction (NTP)-based autoregressive (AR) image generation. This paper introduces ScalingAR—the first TTS framework specifically designed for NTP image generation. Its core innovation is a dual-layer scaling mechanism guided solely by intrinsic token entropy as a confidence signal: a contour layer fuses internal and external cues, while a policy layer dynamically steers generation paths—eliminating the need for early decoding or external rewards. Key techniques include entropy estimation, conditional strength scheduling, and adaptive termination of low-confidence token trajectories. Evaluated on GenEval and TIIF-Bench, ScalingAR achieves 12.5% and 15.2% FID improvements, reduces visual token consumption by 62.0%, and mitigates performance degradation by 26.0% under challenging scenarios.

Technology Category

Application Category

📝 Abstract
Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
Problem

Research questions and friction points this paper is trying to address.

Scaling confidence for autoregressive image generation models
Eliminating need for early decoding or auxiliary rewards
Improving robustness while reducing visual token consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses token entropy as novel signal for generation
Operates at profile level fusing intrinsic conditional signals
Adaptively terminates trajectories and schedules guidance dynamically
🔎 Similar Papers
No similar papers found.