End-to-End Context Compression at Scale

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bottleneck in long-context language model inference caused by the quadratic growth of KV cache with sequence length. Existing compression methods suffer from trade-offs in generation quality, computational efficiency, or compatibility. To overcome these limitations, we propose the Latent-Context Language Model (LCLM), an end-to-end trained encoder-decoder architecture (0.6B+4B parameters) that compresses long input sequences into compact latent representations. Pretrained on over 350 billion tokens, LCLM supports compression ratios ranging from 1:4 to 1:16. It substantially outperforms prior KV cache compression approaches, advancing the Pareto frontier in terms of task performance, inference speed, and peak memory usage. Furthermore, LCLM enables on-demand expansion of critical context segments, making it well-suited for long-horizon agent applications.
📝 Abstract
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.
Problem

Research questions and friction points this paper is trying to address.

long-context inference
KV cache compression
memory bottleneck
encoder-decoder compression
context window limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Context Language Models
KV cache compression
encoder-decoder compression
long-context inference
Pareto frontier
🔎 Similar Papers
No similar papers found.