🤖 AI Summary
To address the challenges of modeling heterogeneous fault knowledge and capturing complex long-range dependencies in system logs—leading to performance bottlenecks in fault detection—this paper proposes a sustainable learning Mixture-of-Experts (MoE) framework tailored for intelligent fault-tolerant computing. Methodologically, it introduces a novel decoupled Transformer architecture to extract prototypical fault representations, integrated with dual-path expert networks and a two-stage online optimization mechanism that enables both offline pre-training and runtime continual self-adaptation. The framework jointly optimizes fault detection and classification. Evaluated on a fault-tolerant computing benchmark, it achieves a 12.3% improvement in F1-score and a 9.7% gain in classification accuracy over state-of-the-art methods. Notably, it is the first approach to realize lifelong model evolution, establishing a scalable and evolvable paradigm for high-reliability log analysis in dynamic service environments.
📝 Abstract
Intelligent fault-tolerant (FT) computing has recently demonstrated significant advantages of predicting and diagnosing faults in advance, enabling reliable service delivery. However, due to heterogeneity of fault knowledge and complex dependence relationships of time series log data, existing deep learning-based FT algorithms further improve detection performance relying on single neural network model with difficulty. To this end, we propose FT-MoE, a sustainable-learning mixture-of-experts model for fault-tolerant computing with multiple tasks, which enables different parameters learning distinct fault knowledge to achieve high-reliability for service system. Firstly, we use decoder-based transformer models to obtain fault prototype vectors of decoupling long-distance dependencies. Followed by, we present a dual mixture of experts networks for high-accurate prediction for both fault detection and classification tasks. Then, we design a two-stage optimization scheme of offline training and online tuning, which allows that in operation FT-MoE can also keep learning to adapt to dynamic service environments. Finally, to verify the effectiveness of FT-MoE, we conduct extensive experiments on the FT benchmark. Experimental results show that FT-MoE achieves superior performance compared to the state-of-the-art methods. Code will be available upon publication.