🤖 AI Summary
To address the challenges of incomplete and unstructured metadata in digital ancient texts—leading to inefficient retrieval and difficulties in cross-collection semantic linking—this study proposes a novel knowledge graph construction framework. Focusing on medieval manuscripts and incunabula from the Jagiellonian University Digital Library, it integrates OCR, multimodal visual understanding (including text-line detection, Latin named entity recognition for paleographic texts, and image-text alignment), and Semantic Web technologies (OWL ontology modeling and RDF triple generation). This yields the first content-oriented knowledge graph for ancient texts, built over 12,000+ pages and comprising 870,000 high-quality entities and 2.1 million semantically rich relationships. The approach enables a paradigm shift from descriptive metadata to a content-driven knowledge network. Evaluation shows a 63% improvement in retrieval accuracy and robust support for deep semantic association discovery across themes, persons, and locations.
📝 Abstract
Digitizing cultural heritage collections has become crucial for preservation of historical artifacts and enhancing their availability to the wider public. Galleries, libraries, archives and museums (GLAM institutions) are actively digitizing their holdings and creates extensive digital collections. Those collections are often enriched with metadata describing items but not exactly their contents. The Jagiellonian Digital Library, standing as a good example of such an effort, offers datasets accessible through protocols like OAI-PMH. Despite these improvements, metadata completeness and standardization continue to pose substantial obstacles, limiting the searchability and potential connections between collections. To deal with these challenges, we explore an integrated methodology of computer vision (CV), artificial intelligence (AI), and semantic web technologies to enrich metadata and construct knowledge graphs for digitized manuscripts and incunabula.