🤖 AI Summary
To address the high computational cost, excessive memory consumption, and unstable convergence of Frank–Wolfe (FW) algorithms in training deep neural networks—particularly under non-convex settings—this paper proposes Projected-FG, a low-computation and low-memory projected forward-gradient estimation method, and integrates it into the FW framework for the first time. We further design a historical-direction aggregation mechanism to suppress gradient variance, and provide rigorous theoretical guarantees: Projected-FG converges to the global optimum in convex settings and to a first-order stationary point in non-convex settings. Empirical evaluations demonstrate that Projected-FG reduces GPU memory usage by up to 58% while maintaining optimization accuracy and stability comparable to standard FW and SGD. This work establishes a new, provably convergent paradigm for large-scale non-convex optimization.
📝 Abstract
This paper aims to enhance the use of the Frank-Wolfe (FW) algorithm for training deep neural networks. Similar to any gradient-based optimization algorithm, FW suffers from high computational and memory costs when computing gradients for DNNs. This paper introduces the application of the recently proposed projected forward gradient (Projected-FG) method to the FW framework, offering reduced computational cost similar to backpropagation and low memory utilization akin to forward propagation. Our results show that trivial application of the Projected-FG introduces non-vanishing convergence error due to the stochastic noise that the Projected-FG method introduces in the process. This noise results in an non-vanishing variance in the Projected-FG estimated gradient. To address this, we propose a variance reduction approach by aggregating historical Projected-FG directions. We demonstrate rigorously that this approach ensures convergence to the optimal solution for convex functions and to a stationary point for non-convex functions. These convergence properties are validated through a numerical example, showcasing the approach's effectiveness and efficiency.