🤖 AI Summary
To address context-length limitations and scarce annotated data in low-resource long-document Document Visual Question Answering (DocVQA), this paper proposes a unified adaptive framework. It integrates sparse-dense hybrid text retrieval for efficient key-paragraph localization; employs a multi-level verification mechanism for high-quality, automatic question-answer generation to enable robust data augmentation; and introduces adaptive ensemble inference with dynamic configuration generation and early-stopping strategies to enhance model robustness and generalization. Evaluated on the JDocQA benchmark, the framework achieves 83.04% accuracy on yes/no questions, 52.66% on factual questions, and 44.12% on numerical questions—surpassing prior methods. On the LAVA dataset, it attains 59.0%, establishing a new state-of-the-art for Japanese DocVQA.
📝 Abstract
Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions, 52.66% on factual questions, and 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.