Ridgeline: A 2D Roofline Model for Distributed Systems

📅 2022-09-03

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

269K/year

🤖 AI Summary

To address the challenge of unified modeling for multi-dimensional bottlenecks—computation, memory, and network—in distributed systems, this paper proposes Ridgeline, the first two-dimensional Roofline performance modeling framework tailored for distributed scenarios. Ridgeline extends the classical Roofline model by incorporating network bandwidth as a core dimension, establishing a dual-axis coordinate system spanned by operational intensity and communication intensity. This enables unified characterization of all three resource constraints and precise identification of the dominant bottleneck. By generalizing Roofline boundary analysis to account for communication overhead, Ridgeline supports communication-aware prediction of multi-node performance ceilings. Evaluated on data-parallel MLP training, it accurately distinguishes communication-bound from compute-bound regimes and successfully predicts performance scaling inflection points across nodes. Ridgeline thus provides a principled, interpretable, and quantifiable theoretical tool for performance diagnosis and optimization in distributed AI systems.

📝 Abstract

—In this short paper we introduce the Ridgeline model, an extension of the Rooﬂine model [4] for distributed systems. The Rooﬂine model targets shared memory systems, bounding the performance of a kernel based on its operational intensity, and the peak compute throughput and memory band- width of the execution system. In a distributed setting, with multiple communicating compute entities, the network must be taken into account to model the system behavior accurately. The Ridgeline aggregates information on compute, memory, and network limits in one 2D plot to show, in an intuitive way, which of the resources is the expected bottleneck. We show the applicability of the Ridgeline on a case study based on a data-parallel Multi-Layer Perceptron (MLP) instance.

Problem

Research questions and friction points this paper is trying to address.

Extends Roofline model to distributed systems

Accounts for network impact on system performance

Identifies compute, memory, and network bottlenecks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Roofline model for distributed systems

Aggregates compute, memory, network limits visually

Identifies performance bottlenecks in distributed computing

🔎 Similar Papers

No similar papers found.