Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work investigates the numerical reasoning and cross-lingual comprehension capabilities of Large Vision-Language Models (LVLMs) on semi-structured table images—specifically cricket scorecards. To address the lack of dedicated evaluation resources, we introduce MMCRICBENCH-3K, the first visual question-answering benchmark for this task, comprising 3,000 English scorecard images with corresponding QA pairs and featuring a novel English–Hindi cross-script controlled experimental design. The benchmark synthesizes multi-format scorecards and incorporates structure-aware parsing, multi-image contextual reasoning, and implicit domain-knowledge inference tasks. Empirical evaluation reveals that even state-of-the-art LVLMs—including GPT-4o and Qwen2.5-VL—exhibit limited performance on the English subset and suffer substantial degradation on the Hindi subset. These results systematically expose fundamental deficiencies in current LVLMs across three dimensions: structural parsing, numerical computation, and cross-lingual generalization.

Technology Category

Application Category

📝 Abstract

We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs on numerical reasoning over tabular cricket scorecards

Assessing cross-lingual generalization with Hindi-English visual text understanding

Testing structure-aware visual comprehension of semi-structured data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic benchmark for VQA on cricket scorecards

Cross-lingual evaluation with English and Hindi scorecards

Tests numerical reasoning and multilingual generalization capabilities

🔎 Similar Papers

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis