ByteSpan: Information-Driven Subword Tokenisation

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving morphological alignment and compression efficiency in subword tokenization. We propose an information-driven fixed-vocabulary construction method. Its core innovation lies in the first use of prediction error from a byte-level language model as a proxy for information gain to identify subword boundaries: contiguous, highly predictable byte sequences exhibiting high information gain are clustered into semantically coherent subword units. Unlike conventional frequency-based approaches, our method jointly models morphological structure and statistical regularities. Experiments demonstrate significant improvements in morphological alignment scores on English. Moreover, on a 25-language multilingual benchmark, our method achieves compression ratios and Rényi entropy efficiency comparable to Byte Pair Encoding (BPE), confirming its cross-lingual robustness and practical utility.

Technology Category

Application Category

📝 Abstract

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and Rényi efficiency for 25 languages.

Problem

Research questions and friction points this paper is trying to address.

Develops ByteSpan for information-driven subword tokenisation

Groups predictable bytes into subwords using byte-level LM

Improves morphological alignment and efficiency over BPE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses byte-level LM for subword grouping

Identifies predictable byte sequences

Creates efficient vocabularies morphologically aligned

🔎 Similar Papers

No similar papers found.