Still: Amortized KV Cache Compaction in a Single Forward Pass

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the severe memory bottleneck caused by KV caching in long-context inference, necessitating compression methods that simultaneously achieve light weight, high expressivity, and cross-trajectory reusability. The authors propose Still, a lightweight Perceiver module embedded at each layer of a frozen base model, which enables synthetic KV cache compression in a single forward pass without per-context optimization. Trained solely on top of frozen foundation models, Still is the first method to jointly satisfy all three desiderata—lightweight design, strong representational capacity, and reusability—enabling efficient inference and flexible summarization even under extreme compression ratios (8–200×). Evaluated on Qwen and Gemma with 128k context lengths, Still significantly outperforms existing approaches, achieving gains of 8–22 points over the strongest baseline on the RULER benchmark and setting new state-of-the-art results on LongBench summarization tasks.

📝 Abstract

The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory. Existing compaction methods satisfy only part of this requirement: selection methods are lightweight but subset-bound, while synthesis methods are expressive but rely on per-context optimization. Here we introduce Still, a small per-layer Perceiver trained once against a frozen base model that produces compact keys and values in a single forward pass. On Qwen and Gemma models, Still occupies the favorable side of the speed--quality frontier across compression ratios from $8\times$ to $200\times$ and context lengths from $8$k to $128$k. On the long-context RULER grid, Still exceeds the strongest baseline by 8--22 points. The same compact cache also supports free-form summarization, preserving most of the full-context gain on HELMET and winning a pairwise LongBench summarization comparison against KV-Distill. Because compaction is a forward pass, Still can be applied iteratively, entering a long-horizon regime unavailable to per-context methods. We show that amortization makes long-context cache compaction tractable, and synthesis makes its compact state useful at extreme compression.

Problem

Research questions and friction points this paper is trying to address.

KV cache

compaction

long-context

memory bottleneck

language model deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compaction

amortized compression

Perceiver