EviProp: Seeded Relevance Diffusion on Chunk-Page Graphs for Long Multimodal Document Retrieval

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing page-level visual retrievers for long-form multimodal document question answering typically adopt an independent matching paradigm, struggling to retrieve evidence when it is scattered across fine-grained text chunks or relies on intra-document structure. This work proposes modeling documents as multimodal Chunk-Page graphs that encode hierarchical, sequential, and similarity relationships. By integrating dense visual page priors with sparse textual chunk seeds, the approach propagates relevance through personalized PageRank over the graph, effectively fusing local cues with global document structure. The method substantially outperforms current visual retrieval and vision-language fusion baselines, achieving significant gains in downstream QA accuracy on MMLongBench-Doc and LongDocURL, while incurring negligible online retrieval overhead.
📝 Abstract
Retrieving evidence pages from visually rich long documents is a key challenge in document question answering. Existing page-level visual retrievers operate under an independent matching paradigm: each page is scored in isolation based on query-page similarity. This paradigm can under-rank evidence pages whose signals are localized in fine-grained chunks or depend on document-internal associations. We propose EviProp, a retrieval method that recovers such pages via seeded relevance diffusion. EviProp models each document as a multimodal Chunk-Page graph with hierarchical, sequential, and similarity links. Given a query, it combines dense visual page priors with sparse chunk seeds, then runs Personalized PageRank to diffuse relevance over the graph. Experiments on MMLongBench-Doc and LongDocURL show consistent gains in evidence-page retrieval over independent visual retrieval and text-visual fusion baselines. Downstream QA results further show that improved retrieval translates into better answer accuracy, with negligible online retrieval overhead. Our code is released at https://github.com/Flyecnu/EviProp.
Problem

Research questions and friction points this paper is trying to address.

document retrieval
evidence page
long multimodal document
visual retriever
query answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

relevance diffusion
Chunk-Page graph
multimodal document retrieval
Personalized PageRank
evidence retrieval
🔎 Similar Papers
H
Hongwei Zhang
East China Normal University
X
Xiaoman Wang
East China Normal University
Z
Zehui Ling
Fudan University
R
Ruicheng Zhu
Shanghai Jiao Tong University
Y
Yue Zhang
Shanghai Artificial Intelligence Laboratory
Pinlong Cai
Pinlong Cai
Shanghai Artificial Intelligence Laboratory
Artificial IntelligenceDecision IntelligenceKnowledge Systems
F
Fuke Shen
East China Normal University
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving
T
Tongquan Wei
East China Normal University
G
Guohang Yan
Shanghai Artificial Intelligence Laboratory