Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For low-resource Pakistani languages—including Urdu, Shahmukhi, Sindhi, and Pashto—named entity recognition (NER) suffers from severe annotation scarcity and inadequate contextual representations in pretrained models. To address this, we propose a culturally adapted cross-lingual data augmentation framework. Methodologically, we systematically validate the effectiveness of fine-tuning multilingual masked language models (e.g., XLM-R) on Shahmukhi and Pashto for the first time, integrating prompt-driven generative data augmentation with few-shot learning. Our contributions are twofold: (1) designing cross-lingual augmentation strategies explicitly aligned with South Asian orthographic conventions and cultural context; and (2) uncovering synergistic gains of large generative models in ultra-low-resource NER settings. Experiments demonstrate that our approach significantly outperforms zero-shot cross-lingual transfer and conventional data augmentation baselines on Shahmukhi and Pashto NER, achieving absolute F1-score improvements of 12.3–18.7 percentage points.

Technology Category

Application Category

📝 Abstract
Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.
Problem

Research questions and friction points this paper is trying to address.

Improving NER for low-resource Pakistani languages
Addressing lack of annotated datasets via cross-lingual augmentation
Enhancing PLMs' performance with culturally plausible data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual data augmentation for NER
Fine-tuning multilingual masked LLMs
Generative LLMs for few-shot learning
🔎 Similar Papers
No similar papers found.
Toqeer Ehsan
Toqeer Ehsan
Teknologian tutkimuskeskus VTT Oy
Natural Language ProcessingDeep LearningArtificial Intelligence
T
T. Solorio
Department of Natural Language Processing, MBZUAI, Abu Dhabi, United Arab Emirates