Arcee Trinity Large Technical Report

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes the Trinity series of sparse Mixture-of-Experts (MoE) models—Trinity-Large, -Mini, and -Nano—to enhance parameter efficiency and training stability in large language models. The architecture integrates interleaved local-global attention, gated attention, depth-scaled Sandwich normalization, and a Sigmoid-based routing mechanism, and is trained using the Muon optimizer. A key innovation is the introduction of Soft-clamped Momentum Expert Bias Updates (SMEBU), a novel load-balancing strategy that significantly improves MoE training stability. All variants complete training without any loss spikes: Trinity-Nano and -Mini are pretrained on 10 trillion tokens, while Trinity-Large uses 17 trillion tokens. The codebase has been publicly released.

Technology Category

Application Category

📝 Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

sparse model

large language model

parameter efficiency

scalable architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Soft-clamped Momentum Expert Bias Updates

interleaved attention

gated attention

Muon optimizer

🔎 Similar Papers

No similar papers found.

Authors to Follow