๐ค AI Summary
This work addresses the critical lack of real-world, production-grade defect data for AI-driven automated testing. To this end, we introduce PyResBugsโthe first natural-language-annotated dataset targeting residual bugs in Python: those that evade conventional testing and manifest only in production environments. Methodologically, we systematically collect defect pairs (faulty/patched versions) from mainstream Python frameworks, ensuring rigorous human annotation, version alignment, and multi-level validation. Each defect is accompanied by fine-grained natural language descriptions covering root cause, triggering conditions, and observable exception behavior. Our core contribution is the first precise mapping from natural language specifications to executable faults, thereby bridging the long-standing gap between NL-driven fault injection and production-representative defects. Empirical evaluation demonstrates that PyResBugs significantly enhances the generalizability and practical utility of AI-based testing tools in both real-defect detection and controllable fault injection tasks.
๐ Abstract
This paper presents PyResBugs, a curated dataset of residual bugs, i.e., defects that persist undetected during traditional testing but later surface in production, collected from major Python frameworks. Each bug in the dataset is paired with its corresponding fault-free (fixed) version and annotated with multi-level natural language (NL) descriptions. These NL descriptions enable natural language-driven fault injection, offering a novel approach to simulating real-world faults in software systems. By bridging the gap between software fault injection techniques and real-world representativeness, PyResBugs provides researchers with a high-quality resource for advancing AI-driven automated testing in Python systems.