Ridgeline: A 2D Roofline Model for Distributed Systems

📅 2022-09-03
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of unified modeling for multi-dimensional bottlenecks—computation, memory, and network—in distributed systems, this paper proposes Ridgeline, the first two-dimensional Roofline performance modeling framework tailored for distributed scenarios. Ridgeline extends the classical Roofline model by incorporating network bandwidth as a core dimension, establishing a dual-axis coordinate system spanned by operational intensity and communication intensity. This enables unified characterization of all three resource constraints and precise identification of the dominant bottleneck. By generalizing Roofline boundary analysis to account for communication overhead, Ridgeline supports communication-aware prediction of multi-node performance ceilings. Evaluated on data-parallel MLP training, it accurately distinguishes communication-bound from compute-bound regimes and successfully predicts performance scaling inflection points across nodes. Ridgeline thus provides a principled, interpretable, and quantifiable theoretical tool for performance diagnosis and optimization in distributed AI systems.
📝 Abstract
—In this short paper we introduce the Ridgeline model, an extension of the Roofline model [4] for distributed systems. The Roofline model targets shared memory systems, bounding the performance of a kernel based on its operational intensity, and the peak compute throughput and memory band- width of the execution system. In a distributed setting, with multiple communicating compute entities, the network must be taken into account to model the system behavior accurately. The Ridgeline aggregates information on compute, memory, and network limits in one 2D plot to show, in an intuitive way, which of the resources is the expected bottleneck. We show the applicability of the Ridgeline on a case study based on a data-parallel Multi-Layer Perceptron (MLP) instance.
Problem

Research questions and friction points this paper is trying to address.

Extends Roofline model to distributed systems
Accounts for network impact on system performance
Identifies compute, memory, and network bottlenecks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Roofline model for distributed systems
Aggregates compute, memory, network limits visually
Identifies performance bottlenecks in distributed computing
🔎 Similar Papers
No similar papers found.