Bit-level BPE: Below the byte boundary

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive sequence length and increased computational overhead caused by byte-level fallback BPE tokenization on CJK and emoji-rich text, this paper proposes the first bit-granularity lossless compression tokenization framework. Methodologically, it transcends conventional byte- or character-level boundaries by extending BPE to the bit level, introducing a reversible bit-string packing/unpacking mechanism and a language-agnostic lightweight compressor. Its core contribution is the first realization of reversible subword reconstruction at the bit level—preserving the original tokenization semantics and full reversibility while significantly reducing sequence length. Experiments demonstrate an average 38% compression rate on CJK and emoji-dense corpora, with concurrent reductions in training and inference latency, without compromising downstream task performance.

Technology Category

Application Category

📝 Abstract
Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in byte-level subword tokenization for CJK languages
Reduces sequence length caused by byte-level fallbacks without data loss
Improves computational speed during model training and inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit-level BPE for subword tokenization
Lossless compression reduces sequence length
Effective for CJK and emoji characters
🔎 Similar Papers
No similar papers found.