SiDiaC: Sinhala Diachronic Corpus

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address the longstanding scarcity of systematic diachronic corpora for Sinhala NLP research, this paper introduces SiDiaC—the first comprehensive diachronic corpus spanning the 5th to 20th centuries. SiDiaC comprises 46 literary works (58k tokens), precisely dated by composition time and annotated with a dual-layer classification scheme (genre × theme), alongside orthographic normalization. The corpus was constructed by digitizing Sri Lankan National Library manuscripts using Google Document AI OCR, followed by rigorous post-OCR processing, structural correction, historical orthographic modernization, and syntactic normalization adapted from FarPaHC. SiDiaC fills a critical gap in diachronic resource development for low-resource languages, providing a high-quality, multi-genre, temporally stratified foundation for lexical change analysis, neologism tracking, historical syntax studies, and diachronic lexicography.

Technology Category

Application Category

📝 Abstract

SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.

Problem

Research questions and friction points this paper is trying to address.

Creates first diachronic corpus for Sinhala language spanning 16 centuries

Addresses resource scarcity by digitizing and annotating 58k words from literature

Enables historical linguistic studies like lexical change and syntax evolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digitized historical texts using Google Document AI OCR

Applied post-processing for formatting and orthography modernization

Used syntactic annotation and text normalization from FarPaHC

🔎 Similar Papers

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research