🤖 AI Summary
Byte-level language models (e.g., ByT5) suffer from excessively long sequences due to the absence of tokenization, leading to low training and inference efficiency; subword models, in contrast, are sensitive to character-level noise and exhibit uneven cross-lingual compression. To address this, MrT5 introduces a dynamic token merging mechanism into the ByT5 encoder, incorporating a learnable deletion gate that adaptively prunes redundant byte tokens based on contextual cues while preserving and fusing salient information. This is the first work enabling language-adaptive compression in purely byte-level models. MrT5 achieves an average sequence compression rate of 75%, maintains competitive performance on downstream multilingual tasks (XNLI, TyDi QA) relative to ByT5, preserves nearly identical bits-per-byte, and significantly accelerates inference. The approach integrates a learnable deletion gate, context-aware token merging, multilingual continual pretraining, and end-to-end byte-level modeling.
📝 Abstract
Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learned delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively"merges"critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance, as measured by bits-per-byte. Additionally, with multilingual training, MrT5 adapts to the orthographic characteristics of each language, learning language-specific compression rates. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI, TyDi QA, and character-level tasks while reducing sequence lengths by up to 75%. Our approach presents a solution to the practical limitations of existing byte-level models.