🤖 AI Summary
Existing log-based fault diagnosis methods treat anomaly detection and root cause localization as independent tasks, leading to error propagation, heavy reliance on costly monitoring data, and insufficient inter-task coordination. This paper proposes Chimera, the first framework to establish bidirectional interaction and knowledge transfer across data, feature, and decision levels, enabling end-to-end joint modeling of both tasks. Through interactive multi-task learning, Chimera unifies log semantic representation with system-level causal structures, significantly improving diagnostic consistency and robustness. Evaluated on three benchmark datasets, Chimera achieves absolute improvements of 2.92–5.00 percentage points in anomaly detection F1-score and 19.01–37.09 percentage points in root cause localization accuracy. The framework has been successfully deployed and validated on an industrial cloud platform, demonstrating practical efficacy in real-world settings.
📝 Abstract
Log-based fault diagnosis is essential for maintaining software system availability. However, existing fault diagnosis methods are built using a task-independent manner, which fails to bridge the gap between anomaly detection and root cause localization in terms of data form and diagnostic objectives, resulting in three major issues: 1) Diagnostic bias accumulates in the system; 2) System deployment relies on expensive monitoring data; 3) The collaborative relationship between diagnostic tasks is overlooked. Facing this problems, we propose a novel end-to-end log-based fault diagnosis method, Chimera, whose key idea is to achieve end-to-end fault diagnosis through bidirectional interaction and knowledge transfer between anomaly detection and root cause localization. Chimera is based on interactive multi-task learning, carefully designing interaction strategies between anomaly detection and root cause localization at the data, feature, and diagnostic result levels, thereby achieving both sub-tasks interactively within a unified end-to-end framework. Evaluation on two public datasets and one industrial dataset shows that Chimera outperforms existing methods in both anomaly detection and root cause localization, achieving improvements of over 2.92% - 5.00% and 19.01% - 37.09%, respectively. It has been successfully deployed in production, serving an industrial cloud platform.