BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Byte-Pair Encoding (BPE) tokenizers suffer from two critical limitations in multilingual settings: (i) encoding penalties for non-Latin scripts due to UTF-8 byte fragmentation, and (ii) reduced robustness stemming from heuristic regular-expression-based preprocessing. Method: We propose SCRIPT, a structured pre-tokenization framework grounded in Unicode Script and General Category properties. SCRIPT replaces byte-level BPE with script-boundary-aware, rule-based pre-segmentation and enforces constrained BPE merges that preserve character integrity and eliminate cross-script encoding bias. Contribution/Results: Empirical evaluation shows that SCRIPT-BPE achieves token compression rates comparable to standard BPE while completely eliminating encoding penalties for non-Latin languages. Moreover, it significantly improves tokenization robustness—especially under noisy or malformed input—and enhances cross-lingual fairness by ensuring consistent, script-aware segmentation across diverse writing systems.

Technology Category

Application Category

📝 Abstract
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.
Problem

Research questions and friction points this paper is trying to address.

BPE tokenizers struggle with multilingual script handling
Pretokenization fragility due to complex regular expressions
UTF-8 byte conversion issues in token creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SCRIPT encoding uses Unicode script properties
Rule-based pretokenization respects script boundaries
Constrained BPE merging ensures character integrity
🔎 Similar Papers
No similar papers found.