Automated Bug Generation in the era of Large Language Models

📅 2023-10-03

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

🤖 AI Summary

This work addresses the robustness evaluation of learning-based defect prediction and program repair models. We propose a novel method for synthesizing highly stealthy (hard-to-detect) and highly complex (hard-to-fix) defects. Our approach innovatively leverages large language models’ (LLMs) attention mechanisms to guide multi-point code mutation, enabling precise semantic-preserving perturbations while maintaining both functional and representational similarity. Integrating program synthesis with large-scale mutant generation, we construct a high-quality dataset comprising over 435,000 synthetic defects. Experimental results demonstrate that our defects increase the false-negative rate of mainstream defect predictors by 37% and raise the repair failure rate of state-of-the-art program repair models by 52%, significantly outperforming existing baselines. To our knowledge, this is the first work to deeply integrate LLM attention analysis into defect synthesis, establishing a scalable and interpretable paradigm for model robustness assessment.

📝 Abstract

Bugs are essential in software engineering; many research studies in the past decades have been proposed to detect, localize, and repair bugs in software systems. Effectiveness evaluation of such techniques requires complex bugs, i.e., those that are hard to detect through testing and hard to repair through debugging. From the classic software engineering point of view, a hard-to-repair bug differs from the correct code in multiple locations, making it hard to localize and repair. Hard-to-detect bugs, on the other hand, manifest themselves under specific test inputs and reachability conditions. These two objectives, i.e., generating hard-to-detect and hard-to-repair bugs, are mostly aligned; a bug generation technique can change multiple statements to be covered only under a specific set of inputs. However, these two objectives are conflicting for learning-based techniques: A bug should have a similar code representation to the correct code in the training data to challenge a bug prediction model to distinguish them. The hard-to-repair bug definition remains the same but with a caveat: the more a bug differs from the original code, the more distant their representations are and easier to be detected. We propose BugFarm, to transform arbitrary code into multiple complex bugs. BugFarm leverages LLMs to mutate code in multiple locations (hard-to-repair). To ensure that multiple modifications do not notably change the code representation, BugFarm analyzes the attention of the underlying model and instructs LLMs to only change the least attended locations (hard-to-detect). Our comprehensive evaluation of 435k+ bugs from over 1.9M mutants generated by BUGFARM and two alternative approaches demonstrates our superiority in generating bugs that are hard to detect by learning-based bug prediction approaches and hard-to-repair by state-of-the-art learning-based program repair technique.

Problem

Research questions and friction points this paper is trying to address.

Generating complex bugs challenging for prediction and repair models

Ensuring bugs are hard-to-detect by learning-based prediction techniques

Creating hard-to-repair bugs resistant to automated debugging methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs to mutate code in multiple locations

Analyzes model attention to change least attended locations

Generates bugs challenging for detection and repair models

🔎 Similar Papers

No similar papers found.

Authors to Follow