MorphTok: Morphologically Grounded Tokenization for Indian Languages

๐Ÿ“… 2025-04-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the limitation of standard Byte-Pair Encoding (BPE) in ignoring linguistic structure, leading to ambiguity in syllabic scripts like Hindi and Marathi due to vowel dependency. We propose a morphology-aware pre-tokenization paradigm that integrates Sanskrit sandhi rules with Constraint-based BPE (CBPE) to enable linguistically grounded, robust subword segmentation. Our contributions are threefold: (1) the first publicly available Hindiโ€“Marathi tokenization dataset annotated with sandhi boundaries; (2) the CBPE algorithm, which enforces phonologically and morphologically motivated segmentation constraints during merging; and (3) EvalTok, a human-evaluated metric for tokenization quality. Experiments demonstrate that our approach reduces machine translation fertility by 1.68% without increasing computational overhead, while consistently improving language modeling perplexity and translation BLEU scores across multiple benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, extit{EvalTok}, enabling more human-grounded assessment.
Problem

Research questions and friction points this paper is trying to address.

Improves tokenization for Indian languages using morphology
Addresses Unicode ambiguity in diacritics with Constrained BPE
Introduces EvalTok for human evaluation of tokenization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Morphology-aware segmentation pre-tokenization step
Constrained BPE with script-specific constraints
Human evaluation metric EvalTok for assessment
๐Ÿ”Ž Similar Papers
No similar papers found.
Maharaj Brahma
Maharaj Brahma
Indian Institute of Technology Hyderabad (IITH)
Natural Language ProcessingCulture NLPLow-resource MTMultilingual NLP
N
N. Karthika
DepartmentofCSE,IITBombay
Atul Singh
Atul Singh
Applied Researcher
D
D. Adiga
TIH,IITBombay
S
Smruti Bhate
TIH,IITBombay
Ganesh Ramakrishnan
Ganesh Ramakrishnan
Professor, Department of Computer Science and Engineering, Indian Institute of Technology Bombay
Machine LearningRelational LearningInformation ExtractionQuestion AnsweringText Analytics
R
Rohit Saluja
SchoolofComputingandElectricalEngineering,IITMandi,BharatGenConsortium
M
M. Desarkar
DepartmentofCSE,IITHyderabad,BharatGenConsortium