AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Retrieval-Augmented Generation (RAG) often suffers from degraded factual accuracy due to irrelevant retrieved content, while existing context compression methods struggle to simultaneously achieve adaptivity, low latency, and cross-document information integration. To address these challenges, we propose an attention-guided adaptive context compression framework. Our method leverages the self-attention distributions of large language models (LLMs) to design a Top-P dynamic compression algorithm that performs input-aware, adaptive truncation of context length. Additionally, we introduce a holistic relevance assessment mechanism that jointly models semantic consistency across multiple retrieved documents to calibrate response confidence. Experimental results demonstrate that our approach significantly improves compression ratio and reduces inference latency—while preserving high factual accuracy—outperforming both state-of-the-art compression baselines and uncompressed RAG systems. This work establishes a new paradigm for efficient and reliable RAG.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation improves the factual accuracy of Large Language Models (LLMs) by incorporating external context, but often suffers from irrelevant retrieved content that hinders effectiveness. Context compression addresses this issue by filtering out irrelevant information from context before LLM generation. However, existing methods struggle to adaptively adjust compression rates for different context, maintain low latency and integrate information across multiple documents. To overcome these limitations, We introduce AttnComp, an adaptive, efficient and context-aware compression framework. By leveraging the attention mechanism of LLMs to identify relevant information, AttnComp employs a Top-P compression algorithm to retain the minimal set of documents whose cumulative attention weights exceeds a predefined threshold. In addition to compression, AttnComp estimates response confidence by assessing the overall relevance of the retrieved content, enabling users to gauge response reliability. Experiments demonstrate that AttnComp outperforms existing compression methods and uncompressed baselines, achieving higher accuracy with substantial compression rates and lower latency.
Problem

Research questions and friction points this paper is trying to address.

Filtering irrelevant information from retrieved context for LLMs
Adaptively adjusting compression rates across different contexts
Maintaining low latency while integrating multi-document information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses attention mechanism to identify relevant information
Employs Top-P algorithm for adaptive document compression
Estimates response confidence through content relevance assessment
🔎 Similar Papers
No similar papers found.
L
Lvzhou Luo
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing 100190, China; State Key Lab of AI Safety, Beijing 100190, China; University of Chinese Academy of Sciences, CAS, Beijing 100049, China
Yixuan Cao
Yixuan Cao
Shenzhen University
Software EngineeringSecurityKernel & CompilerTesting & VerificationBig Data
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing