Deep Progressive Training: scaling up depth capacity of zero/one-layer models

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the inefficiency of deepening large language models during training. We propose a zero- or one-layer progressive depth-expansion method that dynamically inserts new layers during training—bypassing full retraining from scratch. Theoretically, we systematically analyze four key mechanisms: layer initialization, hyperparameter transfer, learning-rate rescaling, and optimal expansion timing—grounded in optimization dynamics and feature learning. Practically, we design an end-to-end differentiable depth-growth strategy compatible with mainstream architectures such as GPT-2. Empirical evaluation on GPT-2 demonstrates approximately 5× training speedup (or 80% computational savings) with negligible validation loss degradation; the resulting model matches the performance of a fully trained 60-layer, 7B-parameter baseline. Our approach significantly improves both the efficiency and scalability of depth-capacity expansion in large-model training.

Technology Category

Application Category

📝 Abstract

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $approx 80%$ compute, or equivalently accelerate $approx 5 imes$ while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.

Problem

Research questions and friction points this paper is trying to address.

Optimizing model depth scaling to balance accuracy and computational cost

Developing progressive training methods to reduce computation without performance loss

Establishing optimal timing and initialization strategies for model depth expansion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive training scales model depth during training

Zero/one-layer expansion optimizes computation-loss tradeoff

Technique saves 80% compute while maintaining performance

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers