🤖 AI Summary
To address poor accessibility, weak interoperability, and cross-institutional collaboration challenges in cultural heritage digitization—stemming from sparse and inconsistent metadata—this paper proposes a metadata enrichment framework integrating computer vision, large language models (LLMs), and semantic knowledge graphs. Methodologically, it introduces a novel Multi-layer Visual Mechanism (MVM) to dynamically detect and semantically align nested structural features (e.g., seal inscriptions and stamps); combines YOLOv11/Detectron2 for visual detection, fine-tuned LLMs for contextual understanding, RDF/OWL-based knowledge graphs for semantic modeling, and Linked Data standards for interoperability. Evaluated on the Jagiellonian Digital Library’s early printed books dataset, the framework yields a publicly released, manually annotated dataset of 105 high-quality manuscript pages. The resulting methodology is scalable for GLAM (Galleries, Libraries, Archives, Museums) institutions, significantly enhancing domain-specific semantic interoperability and enabling robust structured analysis of cultural heritage assets.
📝 Abstract
The digitization of cultural heritage collections has opened new directions for research, yet the lack of enriched metadata poses a substantial challenge to accessibility, interoperability, and cross-institutional collaboration. In several past years neural networks models such as YOLOv11 and Detectron2 have revolutionized visual data analysis, but their application to domain-specific cultural artifacts - such as manuscripts and incunabula - remains limited by the absence of methodologies that address structural feature extraction and semantic interoperability. In this position paper, we argue, that the integration of neural networks with semantic technologies represents a paradigm shift in cultural heritage digitization processes. We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections by combining fine-tuned computer vision models, large language models (LLMs) and structured knowledge graphs. The Multilayer Vision Mechanism (MVM) appears as the key innovation of MEM. This iterative process improves visual analysis by dynamically detecting nested features, such as text within seals or images within stamps. To expose MEM's potential, we apply it to a dataset of digitized incunabula from the Jagiellonian Digital Library and release a manually annotated dataset of 105 manuscript pages. We examine the practical challenges of MEM's usage in real-world GLAM institutions, including the need for domain-specific fine-tuning, the adjustment of enriched metadata with Linked Data standards and computational costs. We present MEM as a flexible and extensible methodology. This paper contributes to the discussion on how artificial intelligence and semantic web technologies can advance cultural heritage research, and also use these technologies in practice.