Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

In retrieval-augmented generation (RAG), retrieved passages are often lengthy and noisy, frequently exceeding large language model (LLM) input limits; existing compression methods rely on training task-specific models, incurring high computational costs and poor portability. This paper proposes a training-free, lightweight sentence-level compression framework that formulates context filtering as a query-aware attention comprehension task. We innovatively leverage the native decoder self-attention signals of a 0.5B-parameter surrogate LLM as an unsupervised relevance probe, combined with a lightweight classifier to score sentence importance—enabling zero-shot cross-model transfer. Evaluated on LongBench, our method achieves up to 5× compression while matching the QA performance of a 7B supervised compression system, significantly reducing computational overhead and deployment complexity.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$ imes$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.

Problem

Research questions and friction points this paper is trying to address.

Compressing lengthy retrieved passages for LLM input limits

Reducing cost and portability issues in supervised compression models

Identifying sentence relevance via attention probing for efficient compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight sentence-level compression framework

Probes decoder attention with proxy LLM

Uses classifier for sentence relevance identification

🔎 Similar Papers

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression