Arcee Trinity Large Technical Report

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the Trinity series of sparse Mixture-of-Experts (MoE) models—Trinity-Large, -Mini, and -Nano—to enhance parameter efficiency and training stability in large language models. The architecture integrates interleaved local-global attention, gated attention, depth-scaled Sandwich normalization, and a Sigmoid-based routing mechanism, and is trained using the Muon optimizer. A key innovation is the introduction of Soft-clamped Momentum Expert Bias Updates (SMEBU), a novel load-balancing strategy that significantly improves MoE training stability. All variants complete training without any loss spikes: Trinity-Nano and -Mini are pretrained on 10 trillion tokens, while Trinity-Large uses 17 trillion tokens. The codebase has been publicly released.

Technology Category

Application Category

📝 Abstract
We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
sparse model
large language model
parameter efficiency
scalable architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Soft-clamped Momentum Expert Bias Updates
interleaved attention
gated attention
Muon optimizer
🔎 Similar Papers
No similar papers found.
V
Varun Singh
Arcee AI
L
Lucas Krauss
Arcee AI
Sami Jaghouar
Sami Jaghouar
Research Engineer
distributed training
M
Matej Sirovatka
Prime Intellect
Charles Goddard
Charles Goddard
Chief of Frontier Research, Arcee AI
F
Fares Obied
Prime Intellect
J
Jack Min Ong
Prime Intellect
J
Jannik Straube
Prime Intellect
F
Fern
Arcee AI
A
Aria Harley
Arcee AI
C
Conner Stewart
Arcee AI
C
Colin Kealty
Arcee AI
M
Maziyar Panahi
Arcee AI
S
Simon Kirsten
Prime Intellect
A
Anushka Deshpande
Arcee AI
Anneketh Vij
Anneketh Vij
Arcee AI
Artificial IntelligenceMachine LearningNatural Language ProcessingComputer Vision
A
Arthur Bresnu
Arcee AI
P
Pranav Veldurthi
Arcee AI
R
Raghav Ravishankar
Arcee AI
H
Hardik Bishnoi
Arcee AI
D
DatologyAI Team
DatologyAI
M
Mark McQuade
Arcee AI
Johannes Hagemann
Johannes Hagemann
Co-Founder / CTO - Prime Intellect
Scaling LLMs