🤖 AI Summary
This work addresses the challenge that current vision-language models (VLMs) struggle to accurately parse the complex layout structures, dense hierarchical references, and marginal annotations characteristic of ancient Greek critical editions. To tackle this, the authors construct the first large-scale dataset comprising 185,000 synthetically generated pages alongside a century-spanning benchmark of real scanned critical editions, leveraging TEI/XML-based synthetic data combined with authentic images. They systematically evaluate and enhance the structure-aware text recognition capabilities of state-of-the-art VLMs—particularly Qwen3VL-8B—under both zero-shot and fine-tuned settings. Experimental results demonstrate that the fine-tuned Qwen3VL-8B achieves a median character error rate of 1.0% on real images, substantially outperforming conventional OCR tools and confirming the strong potential and efficacy of VLMs for understanding highly structured historical scholarly documents.
📝 Abstract
Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.