Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Intermittent failures in continuous integration (CI) pipelines are notoriously difficult to diagnose, leading to wasted resources and reduced development efficiency. This work proposes FlaXifyer, a few-shot learning approach that integrates the interpretable AI technique LogSift to fine-tune pretrained language models on pipeline logs using only 12 labeled examples per failure class. The method simultaneously predicts failure categories and pinpoints critical log entries indicative of root causes. Evaluated on 2,458 real-world CI failures, FlaXifyer achieves a Macro F1 score of 84.3% and a Top-2 accuracy of 92.0%, reducing the required log inspection effort by 74.4%. Furthermore, it successfully identifies the underlying fault in 87% of cases, demonstrating its effectiveness in accelerating failure diagnosis with minimal labeled data.

Technology Category

Application Category

📝 Abstract

In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.

Problem

Research questions and friction points this paper is trying to address.

intermittent job failure

failure diagnosis

continuous integration

flaky tests

automated triage

Innovation

Methods, ideas, or system contributions that make the work stand out.

few-shot learning

language models

intermittent failure diagnosis