🤖 AI Summary
This work addresses the limitations of traditional approaches that attribute metastable failures solely to overload, which fail to explain system instability in non-overloaded conditions. It proposes, for the first time, a causal characterization of metastable failures, revealing their root cause as destabilizing feedback loops among system components triggered by scheduling decisions—overpowering the components’ inherent stability tendencies. By integrating system interaction modeling, causal analysis, and scheduling policy evaluation, the study establishes a methodology for predicting metastable failures and designing metastable failure-tolerant (MFT) systems. The approach is validated across three diverse case studies, demonstrating both its effectiveness and generality in accurately predicting metastable failures and enabling the construction of resilient MFT systems.
📝 Abstract
Metastable failures are hard to detect, prevent, and mitigate. During a metastable failure, a system exhibits self-sustaining bad behavior even in the absence of adversarial conditions. Prior work focuses on symptoms and has portrayed metastable failures as instances of self-sustaining overload. This characterization leaves the underlying failure causes and dynamics unknown, and does not account for metastable failures that do not manifest as overload. We present the first causal characterization of metastable failures by identifying their origin in metastable faults, i.e., structural destabilizing cycles of interaction among systems components that, in isolation, are stabilizing. Metastable failures arise when scheduling decisions let these destabilizing interactions gain the upper hand over the individual components' stabilizing tendencies. We then derive a methodology to predict metastable failures, and to build metastable-fault-tolerant (MFT) systems. We apply our methodology to three case studies, showcasing the generality of our results.