United We Stand: Towards End-to-End Log-based Fault Diagnosis via Interactive Multi-Task Learning

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing log-based fault diagnosis methods treat anomaly detection and root cause localization as independent tasks, leading to error propagation, heavy reliance on costly monitoring data, and insufficient inter-task coordination. This paper proposes Chimera, the first framework to establish bidirectional interaction and knowledge transfer across data, feature, and decision levels, enabling end-to-end joint modeling of both tasks. Through interactive multi-task learning, Chimera unifies log semantic representation with system-level causal structures, significantly improving diagnostic consistency and robustness. Evaluated on three benchmark datasets, Chimera achieves absolute improvements of 2.92–5.00 percentage points in anomaly detection F1-score and 19.01–37.09 percentage points in root cause localization accuracy. The framework has been successfully deployed and validated on an industrial cloud platform, demonstrating practical efficacy in real-world settings.

Technology Category

Application Category

📝 Abstract
Log-based fault diagnosis is essential for maintaining software system availability. However, existing fault diagnosis methods are built using a task-independent manner, which fails to bridge the gap between anomaly detection and root cause localization in terms of data form and diagnostic objectives, resulting in three major issues: 1) Diagnostic bias accumulates in the system; 2) System deployment relies on expensive monitoring data; 3) The collaborative relationship between diagnostic tasks is overlooked. Facing this problems, we propose a novel end-to-end log-based fault diagnosis method, Chimera, whose key idea is to achieve end-to-end fault diagnosis through bidirectional interaction and knowledge transfer between anomaly detection and root cause localization. Chimera is based on interactive multi-task learning, carefully designing interaction strategies between anomaly detection and root cause localization at the data, feature, and diagnostic result levels, thereby achieving both sub-tasks interactively within a unified end-to-end framework. Evaluation on two public datasets and one industrial dataset shows that Chimera outperforms existing methods in both anomaly detection and root cause localization, achieving improvements of over 2.92% - 5.00% and 19.01% - 37.09%, respectively. It has been successfully deployed in production, serving an industrial cloud platform.
Problem

Research questions and friction points this paper is trying to address.

Bridging data and objective gaps between anomaly detection and root cause localization
Reducing diagnostic bias and expensive monitoring dependencies in fault diagnosis
Establishing collaborative relationships between diagnostic tasks through multi-task learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses interactive multi-task learning for fault diagnosis
Integrates anomaly detection and root cause localization
Implements multi-level interaction strategies in unified framework
🔎 Similar Papers
No similar papers found.