Lossless Vocabulary Reduction for Auto-Regressive Language Models

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive language models suffer from heterogeneous tokenization schemes, resulting in incompatible vocabularies and hindering cross-model collaboration—e.g., ensembling. Method: We propose the first lossless vocabulary compression framework, which reconstructs the next-token probability distribution exactly over any given vocabulary subset via probabilistic distribution reconstruction and prefix-tree pruning. Contribution/Results: Theoretically, we establish the first complete formal framework for lossless vocabulary reduction. Practically, our method enables efficient collaborative generation across models with disparate tokenizers by leveraging their largest common vocabulary. Experiments demonstrate that performance is preserved even under extreme compression (e.g., vocabularies of ~100 tokens), while cross-model interoperability and ensemble efficacy are significantly enhanced—without any accuracy degradation.

Technology Category

Application Category

📝 Abstract
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
Problem

Research questions and friction points this paper is trying to address.

Enables auto-regressive language models with different vocabularies to cooperate efficiently
Reduces vocabulary size of language models without losing prediction accuracy
Solves tokenization incompatibility issues in next-token distribution model ensembles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lossless vocabulary reduction for language models
Converts models to smaller vocabularies without accuracy loss
Enables cooperation between models with different tokenizations
🔎 Similar Papers
No similar papers found.
Daiki Chijiwa
Daiki Chijiwa
NTT
T
Taku Hasegawa
NTT Human Informatics Laboratories, NTT Corporation
Kyosuke Nishida
Kyosuke Nishida
NTT Human Informatics Laboratories, NTT Corporation
natural language processingvision and languageartificial intelligencedata mining
Shin'ya Yamaguchi
Shin'ya Yamaguchi
NTT, Kyoto University
Dataset SynthesisGenerative ModelsRepresentation LearningVision-Language Models
T
Tomoya Ohba
NTT Computer and Data Science Laboratories, NTT Corporation
T
Tamao Sakao
NTT Computer and Data Science Laboratories, NTT Corporation
S
Susumu Takeuchi
NTT Computer and Data Science Laboratories, NTT Corporation