Lossless Vocabulary Reduction for Auto-Regressive Language Models

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Autoregressive language models suffer from heterogeneous tokenization schemes, resulting in incompatible vocabularies and hindering cross-model collaboration—e.g., ensembling. Method: We propose the first lossless vocabulary compression framework, which reconstructs the next-token probability distribution exactly over any given vocabulary subset via probabilistic distribution reconstruction and prefix-tree pruning. Contribution/Results: Theoretically, we establish the first complete formal framework for lossless vocabulary reduction. Practically, our method enables efficient collaborative generation across models with disparate tokenizers by leveraging their largest common vocabulary. Experiments demonstrate that performance is preserved even under extreme compression (e.g., vocabularies of ~100 tokens), while cross-model interoperability and ensemble efficacy are significantly enhanced—without any accuracy degradation.

Technology Category

Application Category

📝 Abstract

Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

Problem

Research questions and friction points this paper is trying to address.

Enables auto-regressive language models with different vocabularies to cooperate efficiently

Reduces vocabulary size of language models without losing prediction accuracy

Solves tokenization incompatibility issues in next-token distribution model ensembles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lossless vocabulary reduction for language models

Converts models to smaller vocabularies without accuracy loss

Enables cooperation between models with different tokenizations

🔎 Similar Papers

No similar papers found.