Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

To address the high communication overhead and poor scalability of distributed SGD in large-scale clusters, this paper proposes HybridSGD—a novel two-dimensional (2D) parallel SGD algorithm that unifies and generalizes both 1D s-step SGD and FedAvg into a coherent 2D parallel paradigm. Methodologically, HybridSGD integrates block-wise data partitioning, asynchronous gradient aggregation, and hierarchical synchronization, implemented efficiently in C++/MPI. Theoretically, it establishes a joint trade-off model characterizing convergence rate against communication, computation, and memory costs. Experiments on the Cray EX supercomputer demonstrate that HybridSGD achieves up to 5.3× speedup over s-step SGD and 121× over FedAvg, while attaining faster convergence and higher accuracy on LIBSVM binary classification tasks. Its core contribution is the first 2D SGD framework enabling continuous, fine-grained performance tuning—uniquely balancing computational efficiency, strong scalability, and theoretical rigor.

Technology Category

Application Category

📝 Abstract

Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where communication is more expensive than computation, the scalability and performance of these algorithms are limited by communication cost. This work generalizes prior work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) which attains a continuous performance trade off between the two baseline algorithms. We present theoretical analysis which show the convergence, computation, communication, and memory trade offs between $s$-step SGD, FedAvg, 2D parallel SGD, and other parallel SGD variants. We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system. Our empirical results show that HybridSGD achieves better convergence than FedAvg at similar processor scales while attaining speedups of $5.3 imes$ over $s$-step SGD and speedups up to $121 imes$ over FedAvg when used to solve binary classification tasks using the convex, logistic regression model on datasets obtained from the LIBSVM repository.

Problem

Research questions and friction points this paper is trying to address.

Distributed Computing

Communication Efficiency

Scalability Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

HybridSGD

Efficiency

Scalability

🔎 Similar Papers

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization