🤖 AI Summary
Existing page-level visual retrievers for long-form multimodal document question answering typically adopt an independent matching paradigm, struggling to retrieve evidence when it is scattered across fine-grained text chunks or relies on intra-document structure. This work proposes modeling documents as multimodal Chunk-Page graphs that encode hierarchical, sequential, and similarity relationships. By integrating dense visual page priors with sparse textual chunk seeds, the approach propagates relevance through personalized PageRank over the graph, effectively fusing local cues with global document structure. The method substantially outperforms current visual retrieval and vision-language fusion baselines, achieving significant gains in downstream QA accuracy on MMLongBench-Doc and LongDocURL, while incurring negligible online retrieval overhead.
📝 Abstract
Retrieving evidence pages from visually rich long documents is a key challenge in document question answering. Existing page-level visual retrievers operate under an independent matching paradigm: each page is scored in isolation based on query-page similarity. This paradigm can under-rank evidence pages whose signals are localized in fine-grained chunks or depend on document-internal associations. We propose EviProp, a retrieval method that recovers such pages via seeded relevance diffusion. EviProp models each document as a multimodal Chunk-Page graph with hierarchical, sequential, and similarity links. Given a query, it combines dense visual page priors with sparse chunk seeds, then runs Personalized PageRank to diffuse relevance over the graph. Experiments on MMLongBench-Doc and LongDocURL show consistent gains in evidence-page retrieval over independent visual retrieval and text-visual fusion baselines. Downstream QA results further show that improved retrieval translates into better answer accuracy, with negligible online retrieval overhead. Our code is released at https://github.com/Flyecnu/EviProp.