Tokenization Matters: Improving Zero-Shot NER for Indic Languages

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses performance bottlenecks in zero-shot cross-lingual named entity recognition (NER) for low-resource Indo-Aryan and Dravidian languages—specifically Assamese, Bengali, Santali, Manipuri, and Arabic-script Sindhi—arising from morphological complexity, lexical sparsity, and script diversity. We systematically evaluate the impact of three tokenization strategies—Byte-Pair Encoding (BPE), SentencePiece, and character-level tokenization—on IndicBERT. Our analysis reveals, for the first time, that SentencePiece significantly outperforms both BPE and character-level tokenization: its subword segmentation better captures inflectional morphology and cross-script regularities, substantially reducing out-of-vocabulary (OOV) rates and improving entity boundary consistency. Empirically, SentencePiece yields average F1 gains of 4.2–9.7 percentage points in zero-shot NER across target languages, with particularly marked improvements for extremely low-resource and multigraphic languages. These findings establish a transferable tokenization optimization paradigm for low-resource NER.

Technology Category

Application Category

📝 Abstract
Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer. Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating tokenization methods for Indic NER tasks
Comparing BPE, SentencePiece, and Character Level tokenization
Improving zero-shot cross-lingual NER in low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares BPE, SentencePiece, Character Level tokenization
Uses IndicBERT for NER in Indic languages
SentencePiece outperforms BPE in zero-shot NER
🔎 Similar Papers
No similar papers found.
Priyaranjan Pattnayak
Priyaranjan Pattnayak
Oracle Cloud Gen AI & University of Washington - Seattle
NLPMachine LearningDeep LearningGenerative AI
H
H. Patel
New York University, New York
A
Amit Agarwal
Liverpool John Moores University, Liverpool