HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This study addresses the absence of non-English benchmarks for chart and table visual question answering (VQA), particularly for Japanese, which hinders the evaluation of vision-language models’ comprehension in real-world multilingual documents. To bridge this gap, the authors present the first comprehensive Japanese chart VQA benchmark, systematically constructed from Japanese government white papers. The dataset encompasses over ten chart types, 2,053 images, and human-annotated question-answer pairs. This benchmark fills a critical void in multilingual multimodal evaluation and enables systematic assessment of both open-source and closed-source models. Experimental results reveal that even the best open-source model achieves only 58.6% accuracy—34.9 percentage points lower than leading closed-source counterparts—highlighting substantial limitations in current models’ ability to deeply understand complex charts.

📝 Abstract

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.

Problem

Research questions and friction points this paper is trying to address.

chart understanding

table understanding

visual question answering

non-English benchmark

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Chart and Table Understanding

Japanese VQA Benchmark