AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address context-length limitations and scarce annotated data in low-resource long-document Document Visual Question Answering (DocVQA), this paper proposes a unified adaptive framework. It integrates sparse-dense hybrid text retrieval for efficient key-paragraph localization; employs a multi-level verification mechanism for high-quality, automatic question-answer generation to enable robust data augmentation; and introduces adaptive ensemble inference with dynamic configuration generation and early-stopping strategies to enhance model robustness and generalization. Evaluated on the JDocQA benchmark, the framework achieves 83.04% accuracy on yes/no questions, 52.66% on factual questions, and 44.12% on numerical questions—surpassing prior methods. On the LAVA dataset, it attains 59.0%, establishing a new state-of-the-art for Japanese DocVQA.

Technology Category

Application Category

📝 Abstract
Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions, 52.66% on factual questions, and 44.12% on numerical questions in JDocQA, and 59% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.
Problem

Research questions and friction points this paper is trying to address.

Adaptive framework for long document VQA in low-resource settings
Addresses context limitations and insufficient training data challenges
Improves performance on Japanese document VQA benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid text retrieval architecture for document segmentation
Intelligent data augmentation pipeline with multi-level verification
Adaptive ensemble inference with dynamic configuration mechanisms
🔎 Similar Papers
No similar papers found.
H
Haoxuan Li
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong, China
W
Wei Song
School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China
A
Aofan Liu
School of Information Engineering, Peking University, Shenzhen, Guangdong, China
Peiwu Qin
Peiwu Qin
Tsinghua Shenzhen International Graduate School
Image ProcessingTCM