METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current evaluations of text recognition models are largely confined to modern English printed text, failing to reflect model performance in real-world scenarios involving multiple languages and diverse layouts. To address this limitation, this work proposes METATR—an evolvable, multilingual automatic text recognition benchmark encompassing 29 languages, varied scripts, and complex layouts, featuring the first practice-oriented dynamic evaluation framework. METATR integrates publicly available document data, establishes a unified evaluation protocol, and incorporates standardized prompt engineering, text normalization strategies, and multidimensional metrics—including handwriting robustness and computational efficiency—to enable fair comparison between open-source and closed-source vision-language models. Experimental results demonstrate that while closed-source models generally exhibit greater stability, they still show significant performance variations across different scripts and layouts, thereby validating METATR’s effectiveness in revealing the practical capabilities of text recognition systems.

📝 Abstract

Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.

Problem

Research questions and friction points this paper is trying to address.

Automatic Text Recognition

multilingual benchmark

real-world documents

script diversity

evaluation framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual benchmark

automatic text recognition

evolving evaluation framework