miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial redundant computation in multimodal re-ranking, where existing formats struggle to simultaneously support VQA-style prompting and efficient cache reuse. The paper introduces a novel vision-first re-ranking paradigm that achieves triple-fold efficiency gains through visual cache reuse, an early-exit mechanism, narrow-band cross-segment attention sparsification, and embedding-guided visual token pruning. In high-reuse scenarios, the proposed method reduces per-query re-ranking latency to less than 1% of that of dense models while preserving over 96% of the original effectiveness, thereby dramatically improving computational efficiency without compromising performance.
📝 Abstract
Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query--document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query--document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a \textit{vision-first} formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) \textit{model depth}, for which we reduce active parameters via early exit; (2) \textit{cross-segment attention}, which we restrict to a narrow interaction band across a few layers; and (3) \textit{visual tokens}, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to <1% of the dense implementation under high-reuse settings for a single query, while preserving >96% of the dense model performance.
Problem

Research questions and friction points this paper is trying to address.

multimodal reranking
cache reuse
computation efficiency
visual tokens
cross-segment attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-first
cache reuse
early exit
interaction sparsity
visual token pruning
🔎 Similar Papers
No similar papers found.