Understanding and Detecting Scalability Faults in Large-Scale Distributed Systems

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scalability bugs in large-scale distributed systems often remain undetected during early development due to their inherent complexity and typically manifest only under large-scale deployment. This work presents the first systematic empirical study of 444 real-world scalability issues, introducing a novel perspective that emphasizes the interplay between “dimensional code snippets” and “scalability anti-patterns.” Building upon this insight, the authors integrate static and dynamic analysis techniques to develop ScaleLens, the first targeted detection framework for such issues. Experimental evaluation demonstrates that ScaleLens accurately identifies 334 confirmed defective dimensional code snippets in the latest stable releases of Cassandra, HDFS, and Ignite, achieving a 4.2× higher detection rate compared to baseline approaches.
📝 Abstract
Scalable distributed systems form the backbone of modern computing infrastructure. However, as scale grows, system complexity may lead to scalability faults. Scalability faults are challenging to uncover and diagnose, as they are often latent and only manifest at large-scale deployment. In this paper, we present the first comprehensive study on scalability faults and propose an approach for their detection. First, we systematically investigate 444 scalability issue reports from 10 large-scale distributed systems to understand the common anti-patterns and root causes of scalability faults. We found that the majority of these faults are caused by the synergy between dimensional code fragments and anti-patterns associated with them. Second, based on our findings, we design and implement ScaleLens, a novel approach to detect scalability faults. ScaleLens combines dynamic and static analyses to pinpoint dimensional code fragments and match them with anti-patterns. Our evaluation shows that ScaleLens detects 4.2x more dimensional code fragments associated with known scalability faults compared to the baseline. On the latest stable versions of Cassandra, HDFS, and Ignite, ScaleLens detects 334 dimensional code fragments with confirmed problematic behavior.
Problem

Research questions and friction points this paper is trying to address.

scalability faults
distributed systems
large-scale
fault detection
dimensional code fragments
Innovation

Methods, ideas, or system contributions that make the work stand out.

scalability faults
dimensional code fragments
anti-patterns
static and dynamic analysis
ScaleLens