🤖 AI Summary
Existing general-purpose tokenizers exhibit low segmentation efficiency, poor terminology compression rates, and weak robustness in post-processing OCR outputs for domain-specific texts—particularly in legal, financial, and governmental domains. To address these limitations, this work proposes a family of domain-specialized Byte-Pair Encoding (BPE) tokenizers tailored to these three domains, comprising two novel variants: case-sensitive BPE and, for the first time, boundary-consistent character-level BPE. Our approach incorporates domain-adaptive training, explicit case modeling, character-level subword segmentation, and alignment of token boundaries between erroneous (e.g., OCR-corrupted) and ground-truth text. Experiments demonstrate that kl3m-004-128k-cased reduces token usage by 9–17% compared to GPT-4o and Llama3, achieving terminology compression rates of 83% (legal) and 39% (financial). Moreover, the character-level tokenizer significantly enhances the generalization capability of OCR error correction models.
📝 Abstract
We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.