HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

255K/year
🤖 AI Summary
Training large language models on heterogeneous clusters with mixed-vendor hardware is hindered by existing collective communication frameworks that struggle to efficiently accommodate diverse hardware characteristics, resulting in high communication overhead and low bandwidth utilization. This work proposes HetCCL, a novel communication framework that eliminates host-device memory copies through heterogeneous point-to-point transfers and control-plane CPU offloading. It introduces a boundary communicator mechanism to enable vendor-agnostic reduction operations and employs a hierarchical topology abstraction to optimize cross-device data movement. Integrated with multi-vendor communication libraries, HetCCL achieves 17–19× higher communication bandwidth than Gloo across four heterogeneous configurations and accelerates end-to-end training by up to 16.9% per iteration.
📝 Abstract
Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur heavy overhead in the data path. This paper presents HetCCL, a framework that enables heterogeneous collective communication by efficient P2P transport across heterogeneous devices (e.g., GPUs), eliminating the host-device memory copy overhead while offloading the control to the CPUs. For combining collectives (e.g., AllReduce, ReduceScatter), HetCCL introduces a border-communicator mechanism that achieves vendor independence by using the intrinsic reduction in the combining collectives in vendor collective communication libraries. With efficient heterogeneous P2P transport and portable reduction mechanism, HetCCL proposes a hierarchical topology abstraction for heterogeneous clusters, dissecting collective communication into cluster-level primitives that guarantee optimal cross-cluster data transfer volume and optimal bandwidth utilization. We implement HetCCL with 4 different vendor support and evaluate it in 4 heterogeneous settings with benchmarks and end-to-end LLM tasks. Our evaluation shows that HetCCL achieves 17-19x higher bandwidth than Gloo in heterogeneous communications, and speeds up end-to-end training by up to 16.9% in the per-step-time.
Problem

Research questions and friction points this paper is trying to address.

heterogeneous clusters
collective communication
mixed-vendor
large language models
communication overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous collective communication
P2P transport
border-communicator
vendor independence
hierarchical topology abstraction
🔎 Similar Papers