SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

266K/year

🤖 AI Summary

Traditional offloading strategies for large language model (LLM) training are ill-suited for the GH200 “superchip” architecture—featuring tightly coupled Grace CPUs and Hopper GPUs interconnected via high-bandwidth NVLink-C2C. Method: We propose the first superchip-aware offloading system, integrating ZeRO-based data parallelism and DeepSpeed-Ulysses sequence parallelism. Key innovations include adaptive weight offloading, bucketed data repartitioning, superchip-aware mixed-precision type conversion, lightweight speculative execution, and a Grace-CPU-optimized Adam implementation. Contribution/Results: Our system achieves 2.5× higher throughput on a single GH200 compared to baseline approaches, enabling training of 25B-parameter models. On an 8-GH200 configuration, it successfully trains a 13B model with million-token context windows, attaining a model flops utilization (MFU) of 55%.

Technology Category

Application Category

📝 Abstract

The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM training efficiency on Superchip heterogeneous architectures

Addressing performance gaps between Superchips and traditional GPU-CPU systems

Enabling large-scale model training with limited hardware resources via offloading

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive weight offloading and bucketization repartitioning

Superchip-aware casting and speculative execution techniques

Highly optimized Adam optimizer for Grace CPUs

🔎 Similar Papers

No similar papers found.