Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing visual document retrieval systems struggle to balance efficiency and accuracy at scale: neural approaches require online encoding, while inference-free methods rely on time-consuming text preprocessing and suffer from limited performance. This work proposes V-SPLADE, an inference-free sparse retriever that introduces a novel "title-gated term supervision" mechanism. By leveraging captions generated by vision-language models as lexical supervision signals, V-SPLADE guides the model to activate retrieval-relevant sparse terms. The method achieves, for the first time, dense-model-level performance under high sparsity, improving average NDCG@5 by 13.8 points across six benchmarks—outperforming dense models of comparable size. On a corpus of 18.7 million documents, it doubles R@5 and further advances the state of the art by 2.4 points through score fusion.

📝 Abstract

As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.

Problem

Research questions and friction points this paper is trying to address.

visual document retrieval

inference-free retrieval

sparse retrieval

lexical grounding

multimodal retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-free retrieval

learned sparse retrieval

visual document search