Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Extensions to Distributed Optimization and Comparison Oracle

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This paper addresses severe data corruption in AI optimization under heavy-tailed noise—where gradients have bounded κ-th moments for κ ∈ (1, 2]—a setting where standard gradient-based methods fail. Method: We propose the first hyperparameter-free, gradient-clipping-free sign-based optimization framework, leveraging only gradient signs and majority-voting mechanisms. Contributions: (1) We establish the first high-probability optimal sample complexity bound of $ ilde{O}(varepsilon^{-(3kappa-2)/(kappa-1)})$ for SignSGD; (2) We extend the framework to distributed settings via MajorityVote-CompsSGD, achieving $ ilde{O}(varepsilon^{-4})$ communication complexity; (3) We derive the first high-probability comparison complexity bound of $ ilde{O}(varepsilon^{-6})$ for comparison-based zeroth-order optimization. The method is provably robust, structurally simple, and theoretically optimal. Empirically, it significantly outperforms existing baselines in large language model training.

Technology Category

Application Category

📝 Abstract

The growing popularity of AI optimization problems involving severely corrupted data has increased the demand for methods capable of handling heavy-tailed noise, i.e., noise with bounded $kappa$-th moment, $kappa in (1,2]$. For the widely used clipping technique, effectiveness heavily depends on the careful tuning of clipping levels throughout training. In this paper, we demonstrate that using only the sign of the input, without introducing additional hyperparameters, is sufficient to cope with heavy-tailed noise effectively. For smooth non-convex functions, we prove that SignSGD achieves optimal sample complexity $ ilde{O}left(varepsilon^{-frac{3kappa - 2}{kappa - 1}} ight)$ with high probability for attaining an average gradient norm accuracy of $varepsilon$. Under the assumption of symmetric noise, we use SignSGD with Majority Voting to extend this bound to the distributed optimization or reduce the sample complexity to $ ilde{O}(varepsilon^{-4})$ in the case of a single worker with arbitrary parameters. Furthermore, we explore the application of the sign operator in zeroth-order optimization with an oracle that can only compare function values at two different points. We propose a novel method, MajorityVote-CompsSGD, and provide the first-known high-probability bound $ ilde{O}(varepsilon^{-6})$ for the number of comparisons under symmetric noise assumption. Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models.

Problem

Research questions and friction points this paper is trying to address.

Handles heavy-tailed noise in AI optimization.

Extends SignSGD to distributed optimization effectively.

Applies sign operator in zeroth-order optimization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SignSGD for heavy-tailed noise

Majority Voting in distributed optimization

MajorityVote-CompsSGD for zeroth-order optimization

🔎 Similar Papers

No similar papers found.