Can maiBERT Speak for Maithili?

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Maithili, a low-resource language, suffers from a scarcity of high-quality textual corpora and dedicated pre-trained language models. To address this gap, we construct the first large-scale Maithili corpus and propose maiBERT—the first domain-specific language model for Maithili—built upon the BERT architecture with masked language modeling (MLM) as the pre-training objective. As the inaugural modern NLP pre-trained model for Maithili, maiBERT fills a foundational model void for the language. On a news classification benchmark, fine-tuned maiBERT achieves 87.02% accuracy, outperforming NepBERTa and HindiBERT by 0.13% (absolute) and 5–7 percentage points (multi-class), respectively. The model is publicly released and supports downstream tasks including sentiment analysis and named entity recognition, establishing a scalable foundation for AI applications in Maithili.

Technology Category

Application Category

📝 Abstract
Natural Language Understanding (NLU) for low-resource languages remains a major challenge in NLP due to the scarcity of high-quality data and language-specific models. Maithili, despite being spoken by millions, lacks adequate computational resources, limiting its inclusion in digital and AI-driven applications. To address this gap, we introducemaiBERT, a BERT-based language model pre-trained specifically for Maithili using the Masked Language Modeling (MLM) technique. Our model is trained on a newly constructed Maithili corpus and evaluated through a news classification task. In our experiments, maiBERT achieved an accuracy of 87.02%, outperforming existing regional models like NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7% improvement across various classes. We have open-sourced maiBERT on Hugging Face enabling further fine-tuning for downstream tasks such as sentiment analysis and Named Entity Recognition (NER).
Problem

Research questions and friction points this paper is trying to address.

Addressing NLU challenges for low-resource Maithili language
Developing specialized BERT model for Maithili computational processing
Overcoming data scarcity to enable Maithili digital applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

BERT-based model for Maithili language
Masked Language Modeling pre-training technique
Open-sourced on Hugging Face platform
🔎 Similar Papers
No similar papers found.