Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses vocabulary mismatch and performance degradation in cross-tokenizer transfer of pretrained large language models (LLMs). We propose a training-free, zero-gradient tokenizer transplantation method. Our approach leverages shared anchor tokens as bases and employs orthogonal matching pursuit (OMP) to sparsely reconstruct new token embeddings as linear combinations thereof, enabling unsupervised alignment in the embedding space. This work establishes, for the first time, a “zero-training, zero-gradient” paradigm for tokenizer migration; reveals the critical impact of numerical tokenization mismatches on mathematical reasoning capabilities; and supports plug-and-play, post-hoc vocabulary recalibration. Evaluated on zero-shot cross-tokenizer tasks—including Llama→Mistral and Qwen→Llama—our method significantly outperforms baselines such as zero-initialized embedding replacement and WECHSEL, achieving state-of-the-art performance. The approach has been integrated into the open-source toolkit mergekit-tokensurgeon.

Technology Category

Application Category

📝 Abstract

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--Llama$ o$Mistral NeMo (12B) and Qwen$ o$Llama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.

Problem

Research questions and friction points this paper is trying to address.

Transplant tokenizers in LLMs without training

Reconstruct unseen token embeddings via OMP

Bridge large tokenizer discrepancies without updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free tokenizer transplantation via OMP

Sparse linear combination for unseen tokens

Preserves model performance without gradient updates

🔎 Similar Papers

Zero-Shot Tokenizer Transfer

2024-05-13Neural Information Processing SystemsCitations: 22

Unsupervised Morphological Tree Tokenizer

2024-06-21arXiv.orgCitations: 0

Authors to Follow