🤖 AI Summary
This work addresses the challenge of building efficient and scalable language models by proposing OLMoE-1B-7B, a fully open-source sparse Mixture-of-Experts (MoE) architecture: it activates only 1 billion parameters per token yet is pretrained on 5 trillion tokens, with an instruction-tuned variant—OLMoE-1B-7B-Instruct. Methodologically, it pioneers end-to-end openness for MoE models—releasing weights, training data, code, and full training logs—and introduces a highly specialized dynamic routing mechanism, integrated with large-scale distributed training and systematic MoE training analysis. Empirically, OLMoE-1B-7B outperforms Llama2-13B-Chat and DeepSeekMoE-16B across multiple benchmarks at equivalent activated parameter counts, validating the efficacy and advancement of the “small activation, large capacity, full openness” paradigm.
📝 Abstract
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.