MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Byte-level language models (e.g., ByT5) suffer from excessively long sequences due to the absence of tokenization, leading to low training and inference efficiency; subword models, in contrast, are sensitive to character-level noise and exhibit uneven cross-lingual compression. To address this, MrT5 introduces a dynamic token merging mechanism into the ByT5 encoder, incorporating a learnable deletion gate that adaptively prunes redundant byte tokens based on contextual cues while preserving and fusing salient information. This is the first work enabling language-adaptive compression in purely byte-level models. MrT5 achieves an average sequence compression rate of 75%, maintains competitive performance on downstream multilingual tasks (XNLI, TyDi QA) relative to ByT5, preserves nearly identical bits-per-byte, and significantly accelerates inference. The approach integrates a learnable deletion gate, context-aware token merging, multilingual continual pretraining, and end-to-end byte-level modeling.

Technology Category

Application Category

📝 Abstract

Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learned delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively"merges"critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance, as measured by bits-per-byte. Additionally, with multilingual training, MrT5 adapts to the orthographic characteristics of each language, learning language-specific compression rates. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI, TyDi QA, and character-level tasks while reducing sequence lengths by up to 75%. Our approach presents a solution to the practical limitations of existing byte-level models.

Problem

Research questions and friction points this paper is trying to address.

Address inefficiency in byte-level language models

Reduce sequence length without losing critical information

Improve multilingual adaptability and compression rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token merging reduces sequence length.

Learned delete gate optimizes token retention.

Multilingual training adapts to language specifics.

🔎 Similar Papers

No similar papers found.

Authors to Follow