🤖 AI Summary
Manual page structure analysis of incunabula (early printed books) is labor-intensive, inefficient, and error-prone. Method: This paper introduces the first end-to-end multimodal framework for historical book page analysis. We construct a dedicated dataset of 500 manually annotated pages, augmented with DocLayNet, and employ YOLO11n for fine-grained detection of text, headings, figures, tables, and handwritten regions (F1 = 0.94). OCR is performed using Tesseract—outperforming Kraken—and illustrative content is semantically described via a hybrid approach combining ResNet18 (98.7% image classification accuracy) and CLIP. Contribution/Results: This work unifies object detection, OCR, and cross-modal semantic understanding for incunabula analysis—the first such integration—demonstrating both the efficacy and scalability of deep learning in digital humanities research on early printed materials.
📝 Abstract
We developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.