MuLoCo: Muon is a practical inner optimizer for DiLoCo

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In DiLoCo-style distributed training, high communication overhead persists due to reliance on full-parameter all-reduce operations, and existing compression mechanisms fail to fully exploit optimizer-specific properties. Method: We propose MuLoCo—a novel framework that replaces AdamW with the lightweight Muon optimizer as the inner-loop optimizer, revealing for the first time its substantial enhancement of gradient compressibility; it further integrates Top-k sparsification, 2-bit quantization, and error-feedback accumulation within the DiLoCo architecture. Contribution/Results: Experiments on decoder-only large language model pretraining show that MuLoCo reduces communication volume to 1/8 of DiLoCo’s while maintaining identical memory footprint and achieving significantly higher training throughput—without any degradation in final model performance.

Technology Category

Application Category

📝 Abstract

DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo's communication steps still involve all-reducing a complete copy of the model's parameters. While existing works have explored ways to reduce communication in DiLoCo, the role of error feedback accumulators and the effect of the inner-optimizer on compressibility remain under-explored. In this work, we investigate the effectiveness of standard compression methods including Top-k sparsification and quantization for reducing the communication overhead of DiLoCo when paired with two local optimizers (AdamW and Muon). Our experiments pre-training decoder-only transformer language models (LMs) reveal that leveraging Muon as the inner optimizer for DiLoCo along with an error-feedback accumulator allows to aggressively compress the communicated delta to 2-bits with next to no performance degradation. Crucially, MuLoCo (Muon inner optimizer DiLoCo) significantly outperforms DiLoCo while communicating 8X less and having identical memory complexity.

Problem

Research questions and friction points this paper is trying to address.

Reduce communication overhead in DiLoCo framework

Explore inner-optimizer impact on parameter compressibility

Maintain model performance with aggressive 2-bit compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon optimizer reduces DiLoCo communication overhead

2-bit compression with error-feedback maintains performance

MuLoCo outperforms DiLoCo with 8X less communication

🔎 Similar Papers

A Reinforcement Learning Environment for Automatic Code Optimization in the MLIR Compiler