ByteSpan: Information-Driven Subword Tokenisation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously achieving morphological alignment and compression efficiency in subword tokenization. We propose an information-driven fixed-vocabulary construction method. Its core innovation lies in the first use of prediction error from a byte-level language model as a proxy for information gain to identify subword boundaries: contiguous, highly predictable byte sequences exhibiting high information gain are clustered into semantically coherent subword units. Unlike conventional frequency-based approaches, our method jointly models morphological structure and statistical regularities. Experiments demonstrate significant improvements in morphological alignment scores on English. Moreover, on a 25-language multilingual benchmark, our method achieves compression ratios and Rényi entropy efficiency comparable to Byte Pair Encoding (BPE), confirming its cross-lingual robustness and practical utility.

Technology Category

Application Category

📝 Abstract
Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and Rényi efficiency for 25 languages.
Problem

Research questions and friction points this paper is trying to address.

Develops ByteSpan for information-driven subword tokenisation
Groups predictable bytes into subwords using byte-level LM
Improves morphological alignment and efficiency over BPE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses byte-level LM for subword grouping
Identifies predictable byte sequences
Creates efficient vocabularies morphologically aligned
🔎 Similar Papers
No similar papers found.
Zébulon Goriely
Zébulon Goriely
PhD Student, University of Cambridge
child language acquisitionlanguage models
Suchir Salhan
Suchir Salhan
University of Cambridge
Machine LearningLanguage ModelsNatural Language ProcessingLinguisticsCognitive Science
Pietro Lesci
Pietro Lesci
University of Cambridge
InterpretabilityCausalityMemorisationTokenisationActive Learning
J
Julius Cheng
Department of Computer Science and Technology, University of Cambridge, U.K.
P
Paula Buttery
Department of Computer Science and Technology, University of Cambridge, U.K.; ALTA Institute, University of Cambridge, U.K.