Multi-Timescale Gradient Sliding for Distributed Optimization

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high communication round complexity and rigid sub-agent communication rates in convex nonsmooth distributed optimization, this paper proposes two deterministic first-order algorithms: MT-GS and AMT-GS. Methodologically, we introduce, for the first time, a multi-timescale dual block update mechanism into nonsmooth distributed optimization, building a block-decomposable primal-dual framework coupled with multi-timescale sliding. The algorithms support asynchronous, user-configurable subset communication and explicitly exploit the local function similarity parameter (A). Theoretically, MT-GS achieves a communication complexity of (O(overline{r}A/varepsilon)), while AMT-GS attains (O(overline{r}A/sqrt{varepsilonmu})) — both exhibiting optimal linear dependence on (A). This resolves, affirmatively and for the first time, the open question posed by Arjevani and Shamir on whether linear (A)-dependence is attainable in the nonsmooth setting.

Technology Category

Application Category

📝 Abstract
We propose two first-order methods for convex, non-smooth, distributed optimization problems, hereafter called Multi-Timescale Gradient Sliding (MT-GS) and its accelerated variant (AMT-GS). Our MT-GS and AMT-GS can take advantage of similarities between (local) objectives to reduce the communication rounds, are flexible so that different subsets (of agents) can communicate at different, user-picked rates, and are fully deterministic. These three desirable features are achieved through a block-decomposable primal-dual formulation, and a multi-timescale variant of the sliding method introduced in Lan et al. (2020), Lan (2016), where different dual blocks are updated at potentially different rates. To find an $epsilon$-suboptimal solution, the complexities of our algorithms achieve optimal dependency on $epsilon$: MT-GS needs $O(overline{r}A/epsilon)$ communication rounds and $O(overline{r}/epsilon^2)$ subgradient steps for Lipchitz objectives, and AMT-GS needs $O(overline{r}A/sqrt{epsilonmu})$ communication rounds and $O(overline{r}/(epsilonmu))$ subgradient steps if the objectives are also $mu$-strongly convex. Here, $overline{r}$ measures the ``average rate of updates'' for dual blocks, and $A$ measures similarities between (subgradients of) local functions. In addition, the linear dependency of communication rounds on $A$ is optimal (Arjevani and Shamir 2015), thereby providing a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015).
Problem

Research questions and friction points this paper is trying to address.

Distributed optimization for non-smooth convex problems
Reducing communication rounds via multi-timescale gradient sliding
Achieving optimal dependency on solution accuracy and similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-timescale gradient sliding for optimization
Block-decomposable primal-dual formulation
Reduced communication rounds via objective similarities
🔎 Similar Papers
No similar papers found.