🤖 AI Summary
To address the low training efficiency of Transformers, this work proposes, for the first time, a multi-level adaptive discretization training framework grounded in their continuous-time ordinary differential equation (ODE) formulation. Without altering the model architecture or loss function, the method dynamically adjusts the granularity of ODE numerical integration across optimization stages by jointly optimizing gradient scaling and step-size adaptation, thereby enabling fine-grained allocation of computational resources. Its core innovation lies in systematically incorporating multi-level numerical integration techniques into Transformer training—overcoming the limitations of conventional fixed-step or single-scale discretization schemes. Experiments demonstrate that the approach maintains full model accuracy while significantly accelerating convergence; end-to-end training time is reduced by 30–40%. This establishes a novel paradigm for efficient large-model training.
📝 Abstract
In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate way of varying the discretization of these ODE Transformers in order to accelerate the training. We validate our approach experimentally by a comparison with the standard training procedure.