BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Byte-Pair Encoding (BPE) tokenizers suffer from two critical limitations in multilingual settings: (i) encoding penalties for non-Latin scripts due to UTF-8 byte fragmentation, and (ii) reduced robustness stemming from heuristic regular-expression-based preprocessing. Method: We propose SCRIPT, a structured pre-tokenization framework grounded in Unicode Script and General Category properties. SCRIPT replaces byte-level BPE with script-boundary-aware, rule-based pre-segmentation and enforces constrained BPE merges that preserve character integrity and eliminate cross-script encoding bias. Contribution/Results: Empirical evaluation shows that SCRIPT-BPE achieves token compression rates comparable to standard BPE while completely eliminating encoding penalties for non-Latin languages. Moreover, it significantly improves tokenization robustness—especially under noisy or malformed input—and enhances cross-lingual fairness by ensuring consistent, script-aware segmentation across diverse writing systems.

Technology Category

Application Category

📝 Abstract

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.

Problem

Research questions and friction points this paper is trying to address.

BPE tokenizers struggle with multilingual script handling

Pretokenization fragility due to complex regular expressions

UTF-8 byte conversion issues in token creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SCRIPT encoding uses Unicode script properties

Rule-based pretokenization respects script boundaries

Constrained BPE merging ensures character integrity

🔎 Similar Papers

No similar papers found.