🤖 AI Summary
Autoregressive language models suffer from heterogeneous tokenization schemes, resulting in incompatible vocabularies and hindering cross-model collaboration—e.g., ensembling.
Method: We propose the first lossless vocabulary compression framework, which reconstructs the next-token probability distribution exactly over any given vocabulary subset via probabilistic distribution reconstruction and prefix-tree pruning.
Contribution/Results: Theoretically, we establish the first complete formal framework for lossless vocabulary reduction. Practically, our method enables efficient collaborative generation across models with disparate tokenizers by leveraging their largest common vocabulary. Experiments demonstrate that performance is preserved even under extreme compression (e.g., vocabularies of ~100 tokens), while cross-model interoperability and ensemble efficacy are significantly enhanced—without any accuracy degradation.
📝 Abstract
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.