Bit-level BPE: Below the byte boundary

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address the excessive sequence length and increased computational overhead caused by byte-level fallback BPE tokenization on CJK and emoji-rich text, this paper proposes the first bit-granularity lossless compression tokenization framework. Methodologically, it transcends conventional byte- or character-level boundaries by extending BPE to the bit level, introducing a reversible bit-string packing/unpacking mechanism and a language-agnostic lightweight compressor. Its core contribution is the first realization of reversible subword reconstruction at the bit level—preserving the original tokenization semantics and full reversibility while significantly reducing sequence length. Experiments demonstrate an average 38% compression rate on CJK and emoji-dense corpora, with concurrent reductions in training and inference latency, without compromising downstream task performance.

Technology Category

Application Category

📝 Abstract

Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiency in byte-level subword tokenization for CJK languages

Reduces sequence length caused by byte-level fallbacks without data loss

Improves computational speed during model training and inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit-level BPE for subword tokenization

Lossless compression reduces sequence length

Effective for CJK and emoji characters

🔎 Similar Papers

No similar papers found.