Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
Existing visual document retrieval systems struggle to balance efficiency and accuracy at scale: neural approaches require online encoding, while inference-free methods rely on time-consuming text preprocessing and suffer from limited performance. This work proposes V-SPLADE, an inference-free sparse retriever that introduces a novel "title-gated term supervision" mechanism. By leveraging captions generated by vision-language models as lexical supervision signals, V-SPLADE guides the model to activate retrieval-relevant sparse terms. The method achieves, for the first time, dense-model-level performance under high sparsity, improving average NDCG@5 by 13.8 points across six benchmarks—outperforming dense models of comparable size. On a corpus of 18.7 million documents, it doubles R@5 and further advances the state of the art by 2.4 points through score fusion.
📝 Abstract
As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.
Problem

Research questions and friction points this paper is trying to address.

visual document retrieval
inference-free retrieval
sparse retrieval
lexical grounding
multimodal retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-free retrieval
learned sparse retrieval
visual document search
caption-gated supervision
lexical grounding
🔎 Similar Papers
No similar papers found.
G
Gyu-Hwung Cho
NAVER Corp., Gyeonggi-do, Republic of Korea; Seoul National University, Seoul, Republic of Korea
Y
Youngjune Lee
NAVER Corp., Gyeonggi-do, Republic of Korea
K
Kiyoon Jeong
NAVER Corp., Gyeonggi-do, Republic of Korea
S
Siyoung Lee
NAVER Corp., Gyeonggi-do, Republic of Korea
S
Sanggyu Han
NAVER Corp., Gyeonggi-do, Republic of Korea
H
Hervé Dejean
Naver Labs Europe, Meylan, France
Stéphane Clinchant
Stéphane Clinchant
Naver Labs Europe
Seung-won Hwang
Seung-won Hwang
Seoul National University
language/data understanding