MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Standard Byte-Pair Encoding (BPE) ignores morpheme boundaries, leading to suboptimal modeling of morphologically rich languages. Method: We propose MorphBPE, a morphology-aware tokenizer that explicitly integrates linguistic morpheme structure into the statistical subword segmentation framework. It introduces morpheme-splitting constraints and a dynamic merging strategy to support multilingual morphological rule injection and unsupervised morphological alignment. Contribution/Results: We introduce two novel evaluation metrics—morphological consistency F1 and morphological edit distance—enabling the first end-to-end co-optimization of morphological awareness and BPE’s statistical efficiency. Experiments across English, Russian, Hungarian, and Arabic, using 300M- and 1B-parameter language models, show that MorphBPE significantly reduces cross-entropy loss, accelerates training convergence, and improves morphological alignment scores, while maintaining full backward compatibility with existing LLM pipelines without any architectural or training modifications.

Technology Category

Application Category

📝 Abstract

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org

Problem

Research questions and friction points this paper is trying to address.

BPE Tokenization

Complex Languages

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

MorphBPE

Multilingual Learning

Efficiency Optimization

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer