🤖 AI Summary
To address the excessive sequence length and increased computational overhead caused by byte-level fallback BPE tokenization on CJK and emoji-rich text, this paper proposes the first bit-granularity lossless compression tokenization framework. Methodologically, it transcends conventional byte- or character-level boundaries by extending BPE to the bit level, introducing a reversible bit-string packing/unpacking mechanism and a language-agnostic lightweight compressor. Its core contribution is the first realization of reversible subword reconstruction at the bit level—preserving the original tokenization semantics and full reversibility while significantly reducing sequence length. Experiments demonstrate an average 38% compression rate on CJK and emoji-dense corpora, with concurrent reductions in training and inference latency, without compromising downstream task performance.
📝 Abstract
Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.