FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of fragmented evidence in multimodal long-document question answering, where conventional top-k retrieval struggles to model cross-modal associations among text, tables, and slides. The paper introduces a novel formulation that casts evidence assembly as a minimum-cost flow optimization problem over a multimodal node graph. A unified scoring vector jointly governs source/sink selection, edge costs, and capacities, while integrating MMR-based source selection, length-aware answerability proxies, entropy-regularized replicator dynamics, and a dual-process gating mechanism. This enables end-to-end unification of retrieval, routing, selection, and adaptive computation. Evaluated on VisDoMBench, the method substantially outperforms existing baselines, achieving state-of-the-art results on the PaperTab (58.40) and SlideVQA (72.93) subsets, with a macro-average score of 65.47—approaching the strongest baseline, G²-Reader.

📝 Abstract

Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-$k$ chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FLOWREADER, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector $h$ controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths, a compact non-redundant subset is selected by entropy-regularized replicator dynamics, and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or the routed flow is strained. On VisDoMBench, FLOWREADER is best on the two subsets dominated by fragmented evidence PaperTab ($58.40$, $+1.30$ over G^{2}-Reader) and SlideVQA ($72.93$, $+0.62$) and competitive on SPIQA, FetaTab, and SciGraphQA. Macro-averaged across all five subsets, FLOWREADER ($65.47$) is within $0.74$ of the strongest baseline (G^{2}-Reader, $66.21$). Overall, these results show that min-cost flow performs well on fragmented multimodal evidence, where top-$k$ retrieval fails. It also provides a unified way to control scoring, routing, selection, and adaptive compute together.

Problem

Research questions and friction points this paper is trying to address.

multi-modal

long document

fragmented evidence

question answering

retrieval-augmented systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

min-cost flow

multimodal retrieval

evidence assembly