SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality benchmark corpora that has hindered natural language processing (NLP) research in Sinhala legal texts. We present the first large-scale, systematically constructed Sinhala legal corpus, comprising approximately 2 million words from 1,206 legislative documents. Texts were extracted using Google Document AI followed by rigorous manual proofreading and post-processing to ensure data quality. The corpus is enriched with structured metadata and supports diverse NLP tasks, including named entity recognition, topic modeling, and language model evaluation. Its strong domain specificity and structural regularity make it particularly well-suited for legal summarization and information extraction, thereby filling a critical data gap in Sinhala legal artificial intelligence research.

Technology Category

Application Category

📝 Abstract
SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.
Problem

Research questions and friction points this paper is trying to address.

Sinhala legislative texts
information extraction
benchmark corpus
legal NLP
low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sinhala legal corpus
OCR post-processing
domain-specific NLP
named entity recognition
perplexity analysis
M
Minduli Lasandi
School of Computing, Informatics Institute of Technology, Sri Lanka
Nevidu Jayatilleke
Nevidu Jayatilleke
University of Moratuwa, Sri Lanka
Computational LinguisticsArtificial IntelligenceMachine Learning