🤖 AI Summary
This work addresses the exponential growth of combinatorial failure space in microservice architectures, where traditional random fault injection is inefficient and lacks precise identification of critical failure modes or actionable hardening guidance. To overcome these limitations, the authors propose FastFI, a novel framework that integrates a customized SAT solver leveraging the monotonicity and low overlap properties of CNF formulas via depth-first search, dynamic fault injection, and microservice call-chain analysis. FastFI efficiently enumerates all valid combinatorial faults and provides actionable hardening recommendations based on API criticality assessment. Evaluated on four microservice benchmarks, FastFI reduces end-to-end fault injection time by 76.12% on average, accurately identifies high-impact APIs, and incurs manageable resource overhead.
📝 Abstract
Fault injection is a key technique for assessing software reliability, enabling proactive detection of system defects before they manifest in production. However, the increasing complexity of microservice architectures leads to exponential growth in the fault-injection space, rendering traditional random injection inefficient. Recent lineage-driven approaches mitigate this problem through heuristic pruning, but they face two limitations. First, combinatorial-fault discovery remains bottlenecked by general-purpose SAT solvers, which fail to exploit the monotone and low-overlap structure of derived CNF formulas and typically rely on a static upper bound on fault size. Second, existing techniques provide limited post-injection guidance beyond reporting detected faults. To address these challenges, we propose FastFI, a fault-injection-guided framework to enhance the robustness of API call sites in microservice-based systems. FastFI features a DFS-based solver with dynamic fault injection to discover all valid combinatorial faults, and it leverages fault-injection results to identify critical APIs whose call sites should be hardened for robustness. Experiments on four representative microservice benchmarks show that FastFI reduces end-to-end fault-injection time by an average of 76.12\% compared to state-of-the-art baselines while maintaining acceptable resource overhead. Moreover, FastFI accurately identifies high-impact APIs and provides actionable guidance for call-site hardening.