Token-level Ensembling of Models with Different Vocabularies

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the compatibility bottleneck in token-level ensemble of multiple models arising from inconsistent tokenizer vocabularies. We propose a training-free, model-agnostic inference-time alignment algorithm that requires no modification to base models. Grounded in subword mapping and probability reweighting, our method enables token-level collaboration between encoder-decoder (e.g., mBART) and decoder-only (e.g., Llama) architectures during generation, ensuring surface-form consistency of outputs. To our knowledge, this is the first vocabulary-agnostic, zero-parameter, zero-fine-tuning ensemble paradigm, fully compatible with Hugging Face’s standard interfaces. Evaluated on machine translation, ensembling 12 heterogeneous model pairs yields an average BLEU improvement of +1.8 over individual models—significantly outperforming single-model baselines and fundamentally overcoming the long-standing constraint that traditional ensembles require shared vocabularies.

Technology Category

Application Category

📝 Abstract

Model ensembling is a technique to combine the predicted distributions of two or more models, often leading to improved robustness and performance. For ensembling in text generation, the next token's probability distribution is derived from a weighted sum of the distributions of each individual model. This requires the underlying models to share the same subword vocabulary, limiting the applicability of ensembling, since many open-sourced models have distinct vocabularies. In research settings, experimentation or upgrades to vocabularies may introduce multiple vocabulary sizes. This paper proposes an inference-time only algorithm that allows for ensembling models with different vocabularies, without the need to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models extit{agree} in their surface form. We apply this technique to combinations of traditional encoder-decoder models and decoder-only LLMs and evaluate on machine translation. In addition to expanding to model pairs that were previously incapable of token-level ensembling, our algorithm frequently improves translation performance over either model individually.

Problem

Research questions and friction points this paper is trying to address.

Enables ensembling models with different vocabularies.

Improves translation performance without altering models.

Ensures token agreement in surface form during ensembling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble models with different vocabularies

Inference-time algorithm without additional parameters

Improves translation performance over individual models

🔎 Similar Papers

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling