Constrained Dominant Sets for Multimodal Document Question Answering

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing retrievers in long multimodal document question answering are prone to interference from redundant evidence and often overlook complementary information, leading to the omission of critical evidence. To mitigate this, the authors propose a training-free graph-based retrieval method that constructs an affinity graph with the query serving as a hard structural constraint. By introducing a Constrained Dominant Set (CDS) algorithm guided by spectral bounds, the approach automatically balances relevance and redundancy, while replicator dynamics enable globally optimal evidence selection. Evaluated on VisDoMBench, the method achieves an average score of 66.99, yielding a substantial 37.1-point improvement over the non-retrieval baseline, and further gains 4.8 points on MMLongBench-Doc, establishing a new state-of-the-art performance.
📝 Abstract
Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.
Problem

Research questions and friction points this paper is trying to address.

multimodal document question answering
evidence retrieval
redundancy
complementary evidence
long document understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained Dominant Set
multimodal retrieval-augmented generation
affinity graph
replicator dynamics
evidence selection