A Study on Distributed Strategies for Deep Learning Applications in GPU Clusters

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

Large language models face severe memory bottlenecks and scalability limitations when trained on GPU clusters. Method: This work systematically compares three distributed training paradigms—Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), and Parameter Server (PS)—across multiple models and real-world datasets, quantitatively evaluating trade-offs among GPU memory consumption, training latency, GPU utilization, and model accuracy. Contribution/Results: We find FSDP reduces peak GPU memory by over 60% but incurs up to 6× higher training latency; asynchronous PS improves throughput yet degrades accuracy by up to 3.2%. Crucially, we establish a strong inverse correlation between memory savings and training latency, and identify asynchronous parameter updates as a primary source of accuracy degradation. Based on these empirical insights, we propose a principled, resource- and objective-aware strategy selection framework for distributed training—providing both empirical evidence and a practical deployment guide for large-scale deep learning systems.

Technology Category

Application Category

📝 Abstract

As deep learning models grow in size and complexity, training them efficiently on single GPUs becomes increasingly infeasible. This study investigates the effectiveness of several distributed training strategies-Distributed Data Parallel (DDP), Fully Sharded Data Parallelism (FSDP), and Parameter Server (PS) models-for scalable deep learning on GPU clusters. We conduct empirical evaluations across multiple models and datasets to assess trade-offs in memory usage, training time, GPU utilization, and model accuracy. Our results show that while FSDP reduces GPU memory usage by over 60%, it increases training time by up to 6x compared to DDP. In contrast, asynchronous PS training improves throughput but can lead to degraded accuracy due to stale updates. Through comprehensive analysis, we provide practical insights into the strengths and limitations of each strategy, offering guidance for selecting suitable methods based on system constraints and training objectives.

Problem

Research questions and friction points this paper is trying to address.

Evaluating distributed training strategies for scalable deep learning

Comparing memory usage, training time, and accuracy trade-offs

Providing guidance on strategy selection for GPU clusters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed Data Parallel (DDP) for efficient training

Fully Sharded Data Parallelism (FSDP) reduces GPU memory

Parameter Server (PS) model enhances throughput

🔎 Similar Papers

No similar papers found.