🤖 AI Summary
This study addresses the previously undefined task of automatic Latin-script fragment detection in multilingual, typographically complex historical documents. We introduce the first multimodal annotated dataset (724 pages) specifically designed for this task, incorporating textual content, layout structure, and language annotations. Methodologically, we propose a joint multimodal modeling framework that integrates large foundation models with heterogeneous features—namely OCR-extracted text and document layout representations—to jointly perform cross-modal language identification and spatial localization. Comprehensive experiments evaluate state-of-the-art large language and vision-language models on this task, demonstrating their reliable detection capability. Key contributions include: (1) the formal definition and task formulation of Latin-script fragment detection; (2) the release of the first dedicated benchmark dataset; and (3) an empirical analysis revealing both the promise and limitations of foundation models in real-world historical document digitization—thereby establishing a new paradigm for intelligent cultural heritage processing.
📝 Abstract
This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limits for this task.