🤖 AI Summary
In retrieval-augmented generation (RAG), retrieved passages are often lengthy and noisy, frequently exceeding large language model (LLM) input limits; existing compression methods rely on training task-specific models, incurring high computational costs and poor portability. This paper proposes a training-free, lightweight sentence-level compression framework that formulates context filtering as a query-aware attention comprehension task. We innovatively leverage the native decoder self-attention signals of a 0.5B-parameter surrogate LLM as an unsupervised relevance probe, combined with a lightweight classifier to score sentence importance—enabling zero-shot cross-model transfer. Evaluated on LongBench, our method achieves up to 5× compression while matching the QA performance of a 7B supervised compression system, significantly reducing computational overhead and deployment complexity.
📝 Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$ imes$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.