PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

To address the scalability bottleneck in distributed CNN training on multi-node GPU clusters caused by communication bandwidth limitations, this paper proposes a novel co-optimization paradigm integrating structured pruning with cluster topology. Methodologically, it introduces (1) a hierarchical structured ADMM (H-SADMM) algorithm to enforce node-level structural sparsity; (2) a leader-follower execution model that decouples intra- and inter-node communication while enabling dynamic tensor compression and zero-transmission elimination; and (3) dual-granularity process groups coupled with customized dense reduction primitives. Evaluated on a 64-GPU cluster, the approach reduces cross-node communication volume by 60%, achieves a strong scaling speedup of 6.75×, and significantly outperforms both the dense baseline (5.81×) and Top-K compression (3.71×).

Technology Category

Application Category

📝 Abstract

Inter-node communication bandwidth increasingly constrains distributed training at scale on multi-node GPU clusters. While compact models are the ultimate deployment target, conventional pruning-aware distributed training systems typically fail to reduce communication overhead because unstructured sparsity cannot be efficiently exploited by highly optimized dense collective primitives. We present PruneX, a distributed data-parallel training system that co-designs pruning algorithms with cluster hierarchy to reduce inter-node bandwidth usage. PruneX introduces the Hierarchical Structured ADMM (H-SADMM) algorithm, which enforces node-level structured sparsity before inter-node synchronization, enabling dynamic buffer compaction that eliminates both zero-valued transmissions and indexing overhead. The system adopts a leader-follower execution model with separated intra-node and inter-node process groups, performing dense collectives on compacted tensors over bandwidth-limited links while confining full synchronization to high-bandwidth intra-node interconnects. Evaluation on ResNet architectures across 64 GPUs demonstrates that PruneX reduces inter-node communication volume by approximately 60% and achieves 6.75x strong scaling speedup, outperforming the dense baseline (5.81x) and Top-K gradient compression (3.71x) on the Puhti supercomputer at CSC - IT Center for Science (Finland).

Problem

Research questions and friction points this paper is trying to address.

Reduces inter-node communication in distributed CNN training

Enables efficient structured pruning for bandwidth-limited clusters

Improves scaling performance via hierarchical synchronization design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Structured ADMM algorithm for node-level sparsity

Dynamic buffer compaction to eliminate zero transmissions

Leader-follower model with separated intra- and inter-node groups

🔎 Similar Papers

No similar papers found.