INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

๐Ÿ“… 2026-04-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

174K/year
๐Ÿค– AI Summary
This work addresses the absence of cross-lingual visual question answering (VQA) benchmarks for document tables in low-resource languages such as Indonesian by introducing the first multilingual table VQA dataset based on real-world Indonesian documents, comprising 1,593 images paired with monolingual and cross-lingual question-answer annotations. Leveraging open-source vision-language modelsโ€”including Qwen2.5-VL, Gemma-3, and LLaMA-3.2โ€”the study systematically evaluates and enhances model performance through LoRA fine-tuning and explicit incorporation of table coordinate inputs. Experimental results demonstrate that fine-tuning improves accuracy by 11.6% and 17.8% for 3B and 7B models, respectively, while integrating spatial priors yields an additional gain of 4โ€“7%, underscoring the limitations of current models in handling complex tabular structures and low-resource linguistic settings.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}
Problem

Research questions and friction points this paper is trying to address.

Cross-lingual Table Understanding
Table Visual Question Answering
Vision-Language Models
Low-resource Languages
Document Image Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual Table VQA
Vision-Language Models
Spatial Priors
Low-resource Languages
Document Understanding
๐Ÿ”Ž Similar Papers