🤖 AI Summary
This work addresses the challenge of performance diagnosis in distributed training, where synchronization operations obscure true bottlenecks, lightweight monitoring often yields misleading conclusions, and heavyweight profilers incur prohibitive overhead. The authors propose a lightweight, always-on monitoring mechanism that collects coarse-grained phase durations—such as data loading, forward, and backward passes—from each node. By leveraging a novel frontier-based time accumulation algorithm, the method enables precise, additive global latency attribution without requiring clock synchronization or kernel-level tracing. It accurately identifies the first slow node and phase causing intra-group waiting and pinpoints time windows warranting deeper analysis. Evaluated on a 128-node cluster, the approach incurs less than 0.2% throughput overhead, ranks injected faults within the top two in all 50 experiments, and reproduces heavyweight profiler results using only a 0.11 MB summary—dramatically reducing storage costs compared to 15.81 GB of full trace data.
📝 Abstract
When a distributed training job slows down, the hard part is knowing where to look. Synchronization hides the cause: a stall on one rank shows up as a wait on the others, so a data delay on a single rank can surface as backward time across the group. The cheap dashboards that run all the time -- per-stage averages and maxima -- misread this, double-counting the same exposed delay or burying the slow rank in an average, while full profilers see it clearly but are far too heavy to leave on.
StageFrontier is an always-on signal that closes this gap. Each rank reports only a short ordered vector of coarse stage durations -- data, forward, backward, and so on -- timed with CPU wall-clock, with no synchronized clocks and no kernel tracing. At each stage boundary, StageFrontier takes the cumulative time of whichever rank is furthest along; the increments of this frontier form an exact, additive accounting of the step's exposed time and point to the stage and rank where group-visible delay first appears, telling an operator where to aim a heavy profiler, not which fix to make. The accounting is exact, but the coarse signal alone cannot tell whether a leading stage truly caused the slowdown or merely ran alongside it; StageFrontier labels the windows where that distinction needs more evidence instead of guessing.
A PyTorch implementation adds under 0.2% throughput overhead through 128 ranks on Gloo and NCCL, places injected faults among its top two suspects on all 50 rows of a hidden-rank DDP test, and recovers the same top-stage routing as PyTorch Profiler, HTA, and Nsight Systems once their traces are reduced to the same coarse stages -- from a 0.11 MB summary instead of a 15.81 GB trace.