CompRank: Efficient LLM Reranking via Token-Level Compression and Decoding-Free Scoring

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost and poor scalability of large language model (LLM) rerankers when applied to long candidate lists. To this end, the authors propose CompRank, a novel framework that integrates segment-wise token compression with a decoding-free attention scoring mechanism for the first time. CompRank further incorporates document state reuse, a CopyNet-inspired training objective, and an attention alignment strategy to minimize redundant computation. Experimental results demonstrate that CompRank achieves an average NDCG@10 of 39.2 while retaining only 10.2% of the original document tokens—nearly matching the performance of full-token reranking—and yields up to a 9.5× end-to-end speedup.

📝 Abstract

Large language model (LLM) rerankers have become an important component of modern retrieval and retrieval-augmented generation pipelines, but their high computational cost limits their applicability to long candidate lists. In this paper, we propose \textbf{CompRank}, a token-efficient reranking framework that reduces redundant computation by aligning reranker design with the sparsity of ranking signals. CompRank decouples document representations from candidate order and query context, enabling reusable document-side states; applies segment-wise token compression to reduce query--document interaction cost; and introduces a CopyNet-style objective that directly aligns attention-based document scoring with training supervision. Experiments on seven BEIR datasets show that CompRank achieves strong reranking performance while retaining only 10.2\% of document tokens, reaching an average NDCG@10 of 39.2 compared with 39.7 under full-token attention. Further scaling experiments on TREC-COVID show that CompRank remains stable when evaluated on candidate lists of up to 500 documents after training on 30-document lists, while achieving $4.9\times$--$9.5\times$ end-to-end speedup over generation-based listwise reranking and approximately $1.3\times$ speedup over the full-token CompRank variant. These results suggest that token-level compression and decoding-free attention scoring provide an effective path toward scalable LLM-based reranking.

Problem

Research questions and friction points this paper is trying to address.

LLM reranking

computational cost

long candidate lists

token efficiency

scalable reranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level compression

decoding-free scoring

LLM reranking