Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation and effective models for sentence-level topic classification in Nepali by presenting the first comprehensive benchmark of ten pretrained BERT variants—including multilingual, Indic-language, and Nepali-specific models—on a balanced dataset of 25,006 annotated sentences. Evaluated using accuracy, weighted F1 score, and AUROC, the experiments demonstrate that MuRIL-large (F1: 90.60%) and NepBERTa (F1: 88.26%) significantly outperform all other models, establishing new state-of-the-art baselines for this task. This work fills a critical gap in Nepali natural language processing research and provides a reliable foundation for future applications in low-resource language understanding.

Technology Category

Application Category

📝 Abstract
Transformer-based models such as BERT have significantly advanced Natural Language Processing (NLP) across many languages. However, Nepali, a low-resource language written in Devanagari script, remains relatively underexplored. This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification. Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested on the balanced Nepali dataset containing 25,006 sentences across five conceptual domains and the performance was evaluated using accuracy, weighted precision, recall, F1-score, and AUROC metrics. The results reveal that Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with an F1-score of 88.26%. Overall, these findings establish a robust baseline for future document-level classification and broader Nepali NLP applications.
Problem

Research questions and friction points this paper is trying to address.

Nepali
topic classification
low-resource language
BERT
sentence-level
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nepali NLP
BERT benchmarking
topic classification
low-resource language
Indic language models
🔎 Similar Papers
No similar papers found.
N
Nischal Karki
Information and Language Processing Research Lab, Kathmandu University, Kavre, Nepal
B
Bipesh Subedi
Information and Language Processing Research Lab, Kathmandu University, Kavre, Nepal
P
Prakash Poudyal
Information and Language Processing Research Lab, Kathmandu University, Kavre, Nepal
R
Rupak Raj Ghimire
Information and Language Processing Research Lab, Kathmandu University, Kavre, Nepal
Bal Krishna Bal
Bal Krishna Bal
Professor of Computer Engineering, Kathmandu University
Natural Language ProcessingSentiment AnalysisSoftware Localization