INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of scalable, permissionless, and robust training for large language models (LLMs) in globally distributed, asynchronous, and heterogeneous environments. We propose the first decentralized asynchronous reinforcement learning (RL) framework for LLM training—PRIME-RL—enabling license-free, dynamic cluster training of a 32B-parameter inference model. Methodologically, we introduce a novel globally decentralized RL paradigm, complemented by TOPLOC, a decentralized verification mechanism, and SHARDCAST, a sparse weight distribution protocol; we further enhance stability via an improved GRPO algorithm and dynamic data filtering. Experiments demonstrate that the resulting model achieves state-of-the-art performance on multi-step reasoning tasks, significantly outperforming QwQ-32B. All components—including the full-stack model, training code, and datasets—are open-sourced, establishing a new paradigm for open, scalable, and single-point-of-failure-resilient decentralized AI training.

Technology Category

Application Category

📝 Abstract

We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training.

Problem

Research questions and friction points this paper is trying to address.

Develop globally distributed reinforcement learning for large language models

Create infrastructure for asynchronous training on decentralized compute resources

Improve training stability and performance of reasoning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Globally decentralized reinforcement learning training

PRIME-RL framework for distributed asynchronous RL

Modified GRPO recipe for training stability

🔎 Similar Papers

No similar papers found.

Authors to Follow