Principal Context-aware Diffusion Guided Data Augmentation for Fault Localization

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe class imbalance in fault localization (FL) caused by scarcity of failing test cases, this paper proposes a principal-context-guided conditional diffusion generation method. First, program slicing is integrated with principal component analysis (PCA) to extract fault-sensitive execution contexts. Second, a statistical program dependency model is constructed and embedded as a conditional constraint into the diffusion process, enabling semantic-consistent, fault-relevant synthesis of high-quality failing test cases. This work is the first to combine statistical program dependency modeling with diffusion-based generation and introduces the “principal context” mechanism to guide data augmentation. Evaluated on six state-of-the-art FL techniques, the approach achieves average improvements of 383.83%, 227.08%, and 224.19% in Top-1, Top-3, and Top-5 fault localization accuracy, respectively—outperforming all baseline methods significantly.

Technology Category

Application Category

📝 Abstract
Test cases are indispensable for conducting effective fault localization (FL). However, test cases in practice are severely class imbalanced, i.e. the number of failing test cases (i.e. minority class) is much less than that of passing ones (i.e. majority class). The severe class imbalance between failing and passing test cases have hindered the FL effectiveness. To address this issue, we propose PCD-DAug: a Principal Context-aware Diffusion guided Data Augmentation approach that generate synthesized failing test cases for improving FL. PCD-DAug first combines program slicing with principal component analysis to construct a principal context that shows how a set of statements influences the faulty output via statistical program dependencies. Then, PCD-DAug devises a conditional diffusion model to learn from principle contexts for generating synthesized failing test cases and acquiring a class balanced dataset for FL. We conducted large-scale experiments on six state-of-the-art FL approaches and compare PCD-DAug with six data augmentation baselines. The results show that PCD-DAug significantly improves FL effectiveness, e.g. achieving average improvements of 383.83%, 227.08%, and 224.19% in six FL approaches under the metrics Top-1, Top-3, and Top-5, respectively.
Problem

Research questions and friction points this paper is trying to address.

Addresses class imbalance in test cases for fault localization
Generates synthesized failing test cases using diffusion models
Improves fault localization effectiveness with balanced datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines program slicing with PCA
Uses conditional diffusion model
Generates synthesized failing test cases
🔎 Similar Papers
No similar papers found.
S
Shihao Fu
School of Big Data and Software Engineering, Chongqing University
Yan Lei
Yan Lei
Chongqing University
Software EngineeringFault LocalizationProgram Repair