OLMoE: Open Mixture-of-Experts Language Models

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 32

✨ Influential: 5

career value

253K/year

🤖 AI Summary

This work addresses the challenge of building efficient and scalable language models by proposing OLMoE-1B-7B, a fully open-source sparse Mixture-of-Experts (MoE) architecture: it activates only 1 billion parameters per token yet is pretrained on 5 trillion tokens, with an instruction-tuned variant—OLMoE-1B-7B-Instruct. Methodologically, it pioneers end-to-end openness for MoE models—releasing weights, training data, code, and full training logs—and introduces a highly specialized dynamic routing mechanism, integrated with large-scale distributed training and systematic MoE training analysis. Empirically, OLMoE-1B-7B outperforms Llama2-13B-Chat and DeepSeekMoE-16B across multiple benchmarks at equivalent activated parameter counts, validating the efficacy and advancement of the “small activation, large capacity, full openness” paradigm.

Technology Category

Application Category

📝 Abstract

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

Problem

Research questions and friction points this paper is trying to address.

Develops OLMoE, an open Mixture-of-Experts language model.

Achieves high efficiency with 1B active parameters per token.

Outperforms larger models like Llama2-13B-Chat and DeepSeekMoE-16B.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts (MoE) architecture

Pretrained on 5 trillion tokens

Open-source model weights and training data

🔎 Similar Papers

No similar papers found.