BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multilingual LLM evaluation benchmarks suffer from insufficient coverage—particularly in pretraining data contamination analysis and visual content accessibility. To address this, we reconstruct and extend the BLUEX multilingual evaluation benchmark: we incorporate standardized multilingual examination questions from 2024–2025 and integrate state-of-the-art multilingual image captioning techniques, expanding the total number of evaluable items to 1,422 (a >100% increase over the original). This is the first systematic approach enabling assessment of *implicit visual-context utilization* by purely text-based LLMs, improving their visual information access efficiency by over 40%. The enhanced benchmark significantly increases sensitivity to data contamination and establishes a high-coverage, structured, cross-lingual evaluation platform for multimodal assessment.

Technology Category

Application Category

📝 Abstract
With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multilingual evaluation methods for LLMs
Expanding benchmark coverage with automatic image captioning
Assessing LLMs' ability to utilize visual context through captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic image captioning using state-of-the-art models
Enhanced benchmark coverage with 2024-2025 exam data
Increased accessibility for text-only models by 40%
J
João Guilherme Alves Santos
Instituto de Computação (IC) – Universidade Estadual de Campinas (UNICAMP)
Giovana Kerche Bonás
Giovana Kerche Bonás
Unknown affiliation
Thales Sales Almeida
Thales Sales Almeida
Student, Unicamp
Information retrievalMachine learningDeep learningGenerative models