🤖 AI Summary
This work addresses the high computational cost in large language model (LLM) inference caused by contextual redundancy. We propose ARC-Encoder, a general-purpose, architecture- and parameter-agnostic context compression method that requires no modification to the target LLM. ARC-Encoder learns continuous, compact textual representations to replace original token embeddings, directly injecting them into the decoder’s input layer. Its core contribution is a unified encoder architecture—designed for broad compatibility across diverse LLM families—combined with continuous representation learning and a systematic training strategy, enabling 4×–8× context compression without fine-tuning the target model. Experiments demonstrate significant reductions in inference latency and GPU memory consumption across both instruction-tuned and base LLMs, with seamless plug-and-play deployment. ARC-Encoder achieves state-of-the-art performance in efficiency and practicality.
📝 Abstract
Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x!in!{4,8}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder , fine-tuning dataset and pretrained models are available at https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .