🤖 AI Summary
This work addresses stochastic bilevel optimization problems where the lower-level objective is strongly convex, the upper-level objective is nonconvex, and its gradient lacks Lipschitz continuity—common in Transformer-based models. We propose AdamBO, the first single-loop Adam-type algorithm for bilevel optimization. Methodologically, AdamBO integrates momentum-adapted step sizes, implicit gradient estimation, coordinated updates of upper- and lower-level variables, and a stochastic decoupling lemma to overcome theoretical barriers under unbounded smoothness. We establish its convergence rate with $widetilde{O}(epsilon^{-4})$ stochastic first- and second-order oracle complexity. Empirically, AdamBO significantly outperforms existing baselines on RNN- and Transformer-driven meta-learning tasks. Our approach provides a practical, scalable, and theoretically grounded tool for nonsmooth bilevel optimization.
📝 Abstract
Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we introduce AdamBO, a single-loop Adam-type method that achieves $widetilde{O}(epsilon^{-4})$ oracle complexity to find $epsilon$-stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithm.