AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-scale autoregressive vision Transformers, the key-value (KV) cache grows exponentially with scale, severely limiting model scalability and generation efficiency. Method: This paper presents the first systematic study of KV caching in multi-scale image generation, proposing an adaptive hierarchical caching strategy. Leveraging cross-scale key-value similarity analysis, it distinguishes between local-detail and global-condensed scales, dynamically identifying high-demand layers and prioritizing caching of critical information. The method jointly optimizes multi-scale modeling, cross-scale similarity measurement, and hierarchical cache allocation within vision Transformers, enabling end-to-end training and inference co-optimization. Results: Experiments demonstrate an 84.83% reduction in KV cache size, a 60.48% decrease in self-attention latency, and batch size scaling to 256 without GPU memory overflow—significantly improving the trade-off between generation throughput and output quality.

Technology Category

Application Category

📝 Abstract
Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.
Problem

Research questions and friction points this paper is trying to address.

Optimizing KV caching for multi-scale visual autoregressive transformers
Reducing excessive KV memory growth in next-scale prediction models
Improving computational efficiency through adaptive KV cache policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KV caching prioritizes condensed and local scales
Optimizes cache utilization through inter-scale similarity analysis
Reduces KV cache usage and self-attention latency significantly
🔎 Similar Papers
No similar papers found.
Boxun Xu
Boxun Xu
University of California, Santa Barbara
Brain-inspired MLComputer ArchitectureEfficient AIHW/SW Co-designGenerative AI
Y
Yu Wang
University of California, Santa Barbara
Z
Zihu Wang
University of California, Santa Barbara
P
Peng Li
University of California, Santa Barbara