🤖 AI Summary
This work addresses the high inference latency of visual autoregressive models in high-resolution image generation, a challenge exacerbated by existing acceleration methods that rely on heuristic pruning strategies lacking precision in identifying redundant tokens and exhibiting poor generalization. The authors propose Latent Discrepancy (LD), a novel metric that dynamically quantifies token contribution based on its impact on pixel generation, thereby introducing latent-space signals into redundancy assessment for the first time. Built upon LD, they develop LD-Pruning—a unified, training-free pruning framework that integrates decoding-agnostic region selection, adaptive unconditional branch skipping, and dynamic convergence analysis under classifier-free guidance (CFG). Evaluated on the Infinity-8B model, LD-Pruning achieves up to a 2.35× speedup while preserving generation quality.
📝 Abstract
Visual Autoregressive (VAR) models deliver high-quality image generation but suffer from significant inference latency at high resolutions. Recent acceleration approaches most rely on heuristic measures with layer features to prune tokens. Such heuristics are sensitive to complex contextual semantics, leading to inaccurate identification of redundant computation and poor adaptability across prompts. We rethink redundancy in VAR from the perspective of its impact on pixel-space generation and introduce Latent Discrepancy. This unified metric quantifies a token's contribution by measuring the change in model states during generation. Our analysis shows that redundancy is more accurately identified when guided by image latent or pixel-space signals. We further observed that in classifier-free guidance (CFG), the convergence trend of the discrepancy between conditional and unconditional branches exhibits high dynamics with different prompts. Based on these findings, we propose LD-Pruning (Latent Discrepancy Pruning), a training-free framework that removes redundancy via latent discrepancy by integrating decoding-free region selection and adaptive unconditional-branch skipping. Extensive experiments show that LD-Pruning substantially reduces inference latency while maintaining high generation quality, achieving up to 2.35x speedup on Infinity-8B.