Cross-Tokenizer LLM Distillation through a Byte-Level Interface

πŸ“… 2026-04-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of knowledge distillation between teacher and student large language models caused by incompatible tokenizers. To overcome this issue, the authors propose a byte-level cross-tokenizer distillation method that translates the teacher model’s output into a byte-level probability distribution and aligns it with a lightweight byte-level decoder head integrated into the student model. This approach eliminates the need for complex vocabulary mapping or alignment strategies by leveraging bytes as a universal intermediate representation, thereby enabling effective knowledge transfer across disparate tokenization schemes. Experimental results demonstrate that the proposed method achieves strong performance across multiple benchmark tasks and scales consistently from 1B to 8B parameter models, with certain metrics surpassing those of existing, more intricate distillation techniques.
πŸ“ Abstract
Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.
Problem

Research questions and friction points this paper is trying to address.

Cross-tokenizer distillation
Language model
Knowledge distillation
Tokenizer mismatch
Byte-level interface
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Tokenizer Distillation
Byte-Level Distillation
Language Model Compression
Knowledge Distillation
Tokenizer-Agnostic Interface
πŸ”Ž Similar Papers
No similar papers found.
A
Avyav Kumar Singh
King's College London, London (United Kingdom)
Y
Yen-Chen Wu
MediaTek Research, Cambridge (United Kingdom)
A
Alexandru Cioba
Orbital Materials, London (United Kingdom)
Alberto Bernacchia
Alberto Bernacchia
Director of AI Research at MediaTek Research UK
Machine Learning and Computational Neuroscience
Davide Buffelli
Davide Buffelli
AI Research Scientist at MediaTek Research
Deep LearningMachine LearningArtificial Intelligence