🤖 AI Summary
Maithili, a low-resource language, suffers from a scarcity of high-quality textual corpora and dedicated pre-trained language models. To address this gap, we construct the first large-scale Maithili corpus and propose maiBERT—the first domain-specific language model for Maithili—built upon the BERT architecture with masked language modeling (MLM) as the pre-training objective. As the inaugural modern NLP pre-trained model for Maithili, maiBERT fills a foundational model void for the language. On a news classification benchmark, fine-tuned maiBERT achieves 87.02% accuracy, outperforming NepBERTa and HindiBERT by 0.13% (absolute) and 5–7 percentage points (multi-class), respectively. The model is publicly released and supports downstream tasks including sentiment analysis and named entity recognition, establishing a scalable foundation for AI applications in Maithili.
📝 Abstract
Natural Language Understanding (NLU) for low-resource languages remains a major challenge in NLP due to the scarcity of high-quality data and language-specific models. Maithili, despite being spoken by millions, lacks adequate computational resources, limiting its inclusion in digital and AI-driven applications. To address this gap, we introducemaiBERT, a BERT-based language model pre-trained specifically for Maithili using the Masked Language Modeling (MLM) technique. Our model is trained on a newly constructed Maithili corpus and evaluated through a news classification task. In our experiments, maiBERT achieved an accuracy of 87.02%, outperforming existing regional models like NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7% improvement across various classes. We have open-sourced maiBERT on Hugging Face enabling further fine-tuning for downstream tasks such as sentiment analysis and Named Entity Recognition (NER).