🤖 AI Summary
This work addresses the scarcity of high-quality benchmark corpora that has hindered natural language processing (NLP) research in Sinhala legal texts. We present the first large-scale, systematically constructed Sinhala legal corpus, comprising approximately 2 million words from 1,206 legislative documents. Texts were extracted using Google Document AI followed by rigorous manual proofreading and post-processing to ensure data quality. The corpus is enriched with structured metadata and supports diverse NLP tasks, including named entity recognition, topic modeling, and language model evaluation. Its strong domain specificity and structural regularity make it particularly well-suited for legal summarization and information extraction, thereby filling a critical data gap in Sinhala legal artificial intelligence research.
📝 Abstract
SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.