Communication Efficient LLM Pre-training with SparseLoCo

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Communication bottlenecks severely hinder large language model (LLM) pretraining in bandwidth-constrained settings (e.g., cross-data-center training); existing approaches reduce communication frequency but still transmit full-precision gradients and often underperform AdamW-based DDP baselines. Method: We propose SparseLoCo—the first framework enabling efficient LLM pretraining under extreme gradient compression via Top-*k* sparsification and 2-bit quantization. It incorporates error feedback and distributed momentum optimization, theoretically justifying that outer-layer momentum can be well approximated by highly sparse local updates, and introduces a novel sparse gradient aggregation mechanism. Contribution/Results: SparseLoCo significantly reduces communication overhead while improving both convergence speed and downstream task performance over full-precision DiLoCo, establishing a new state-of-the-art for communication-efficient LLM pretraining.

Technology Category

Application Category

📝 Abstract

Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across data centers and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization and error feedback are often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages Top-k sparsification and quantization to reach extreme compression ratios of up to 1-3% sparsity and 2-bit quantization while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback combined with aggressive sparsity and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication bottleneck in LLM pre-training

Leverages sparsification and quantization for compression

Maintains performance in bandwidth-constrained distributed training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-k sparsification for extreme compression

2-bit quantization reducing communication size

Local momentum approximation with error feedback

🔎 Similar Papers

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs