On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses stochastic bilevel optimization problems where the lower-level objective is strongly convex, the upper-level objective is nonconvex, and its gradient lacks Lipschitz continuity—common in Transformer-based models. We propose AdamBO, the first single-loop Adam-type algorithm for bilevel optimization. Methodologically, AdamBO integrates momentum-adapted step sizes, implicit gradient estimation, coordinated updates of upper- and lower-level variables, and a stochastic decoupling lemma to overcome theoretical barriers under unbounded smoothness. We establish its convergence rate with $widetilde{O}(epsilon^{-4})$ stochastic first- and second-order oracle complexity. Empirically, AdamBO significantly outperforms existing baselines on RNN- and Transformer-driven meta-learning tasks. Our approach provides a practical, scalable, and theoretically grounded tool for nonsmooth bilevel optimization.

Technology Category

Application Category

📝 Abstract

Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we introduce AdamBO, a single-loop Adam-type method that achieves $widetilde{O}(epsilon^{-4})$ oracle complexity to find $epsilon$-stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithm.

Problem

Research questions and friction points this paper is trying to address.

Extend Adam optimizer for bilevel optimization problems.

Address unbounded smoothness in nonconvex upper-level objectives.

Propose AdamBO for efficient stochastic bilevel optimization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Adam to bilevel optimization problems

Introduces AdamBO for stochastic bilevel optimization

Achieves O(ε^-4) oracle complexity efficiently

🔎 Similar Papers

No similar papers found.

Authors to Follow