🤖 AI Summary
To address the challenge of unified modeling for multi-dimensional bottlenecks—computation, memory, and network—in distributed systems, this paper proposes Ridgeline, the first two-dimensional Roofline performance modeling framework tailored for distributed scenarios. Ridgeline extends the classical Roofline model by incorporating network bandwidth as a core dimension, establishing a dual-axis coordinate system spanned by operational intensity and communication intensity. This enables unified characterization of all three resource constraints and precise identification of the dominant bottleneck. By generalizing Roofline boundary analysis to account for communication overhead, Ridgeline supports communication-aware prediction of multi-node performance ceilings. Evaluated on data-parallel MLP training, it accurately distinguishes communication-bound from compute-bound regimes and successfully predicts performance scaling inflection points across nodes. Ridgeline thus provides a principled, interpretable, and quantifiable theoretical tool for performance diagnosis and optimization in distributed AI systems.
📝 Abstract
—In this short paper we introduce the Ridgeline model, an extension of the Roofline model [4] for distributed systems. The Roofline model targets shared memory systems, bounding the performance of a kernel based on its operational intensity, and the peak compute throughput and memory band- width of the execution system. In a distributed setting, with multiple communicating compute entities, the network must be taken into account to model the system behavior accurately. The Ridgeline aggregates information on compute, memory, and network limits in one 2D plot to show, in an intuitive way, which of the resources is the expected bottleneck. We show the applicability of the Ridgeline on a case study based on a data-parallel Multi-Layer Perceptron (MLP) instance.