Optimizing Large Model Training through Overlapped Activation Recomputation

📅 2024-06-13
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
In large-model training, activation recomputation and communication pipelining are difficult to overlap, leading to high critical-path latency and severe memory-compute-communication imbalance. To address this, we propose Lynx—a framework that introduces (1) a fine-grained execution mechanism enabling precise overlap of recomputation and communication; (2) a heuristic scheduling algorithm leveraging model structural similarity to jointly optimize recomputation timing and communication phases; and (3) a recomputation-aware model partitioning strategy that balances GPU memory constraints with load distribution across pipeline stages. Experiments on GPT models ranging from 1.3B to 23B parameters demonstrate that Lynx achieves up to 1.37× higher training throughput compared to state-of-the-art recomputation schemes, significantly alleviates GPU memory bottlenecks, and improves overall system resource utilization.

Technology Category

Application Category

📝 Abstract
Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
Problem

Research questions and friction points this paper is trying to address.

Reduces overhead in large model training by overlapping recomputation with communication
Optimizes recomputation scheduling using heuristic-based algorithm for identical DNN structures
Improves training throughput with recomputation-aware model partitioning method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Overlaps recomputation with communication in pipelines
Uses heuristic-based scheduling for identical DNN structures
Applies recomputation-aware model partitioning for balance
🔎 Similar Papers
No similar papers found.
P
Ping Chen
Zhejiang University, Huawei Cloud†
W
Wenjie Zhang
Zhejiang University, Huawei Cloud†
Shuibing He
Shuibing He
Professor of Zhejiang University
Intelligent ComputingStorage SystemsProcessing-in-MemoryComputer Architecture
Y
Yingjie Gu
Huawei Cloud†
Z
Zhuwei Peng
Huawei Cloud†
K
Kexin Huang
Zhejiang University, Huawei Cloud†
X
Xuan Zhan
Zhejiang University, Huawei Cloud†
W
Weijian Chen
Zhejiang University, Huawei Cloud†
Y
Yi Zheng
Huawei Cloud†
Zhefeng Wang
Zhefeng Wang
Huawei Cloud
NLPAI systemLLMmulti-modalityMachine Learning
Yanlong Yin
Yanlong Yin
Zhejiang University, Huawei Cloud†
G
Gang Chen
Zhejiang University, Huawei Cloud†