🤖 AI Summary
This work addresses the inefficiency of deepening large language models during training. We propose a zero- or one-layer progressive depth-expansion method that dynamically inserts new layers during training—bypassing full retraining from scratch. Theoretically, we systematically analyze four key mechanisms: layer initialization, hyperparameter transfer, learning-rate rescaling, and optimal expansion timing—grounded in optimization dynamics and feature learning. Practically, we design an end-to-end differentiable depth-growth strategy compatible with mainstream architectures such as GPT-2. Empirical evaluation on GPT-2 demonstrates approximately 5× training speedup (or 80% computational savings) with negligible validation loss degradation; the resulting model matches the performance of a fully trained 60-layer, 7B-parameter baseline. Our approach significantly improves both the efficiency and scalability of depth-capacity expansion in large-model training.
📝 Abstract
Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $approx 80%$ compute, or equivalently accelerate $approx 5 imes$ while achieving almost the same loss, compared to to a fully trained 60-layer model with 7B parameters.