🤖 AI Summary
This work addresses the challenge of simultaneously achieving morphological alignment and compression efficiency in subword tokenization. We propose an information-driven fixed-vocabulary construction method. Its core innovation lies in the first use of prediction error from a byte-level language model as a proxy for information gain to identify subword boundaries: contiguous, highly predictable byte sequences exhibiting high information gain are clustered into semantically coherent subword units. Unlike conventional frequency-based approaches, our method jointly models morphological structure and statistical regularities. Experiments demonstrate significant improvements in morphological alignment scores on English. Moreover, on a 25-language multilingual benchmark, our method achieves compression ratios and Rényi entropy efficiency comparable to Byte Pair Encoding (BPE), confirming its cross-lingual robustness and practical utility.
📝 Abstract
Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and Rényi efficiency for 25 languages.