Model-Aware Tokenizer Transfer

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Existing tokenizer transfer methods for large language models (LLMs) struggle to adapt pretrained tokenizers to low-resource languages or scripts, as they typically rely on semantic heuristics for initialization while neglecting higher-level model dynamics—leading to suboptimal adaptation. This work proposes the Attention Influence Modeling (AIM) framework, which leverages inter-token attention communication patterns from the source model as supervisory signals to guide both initialization and optimization of target-language embeddings. AIM integrates attention behavior distillation with lightweight pre-warmup training prior to standard language modeling, enabling efficient tokenizer adaptation. Experiments show that AIM restores over 90% of the original model’s performance using only a few GPU hours, significantly outperforming mainstream baselines across diverse low-resource and cross-script languages. To our knowledge, this is the first tokenizer transfer approach driven explicitly by internal model dynamics.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.

Problem

Research questions and friction points this paper is trying to address.

Transferring tokenizers to low-resource languages efficiently

Overcoming embedding initialization limitations in multilingual LLMs

Leveraging attention patterns for robust cross-lingual adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

MATT transfers tokenizers using model internals

AIM objective distills inter-token communication patterns

Leverages attention behavior for embedding initialization

🔎 Similar Papers

Zero-Shot Tokenizer Transfer