🤖 AI Summary
Current research on automatic classification of software issue reports suffers from three critical gaps: insufficient practitioner involvement, overreliance on accuracy for evaluation while neglecting industrially relevant dimensions—such as interpretability and scalability—and exclusive reliance on open-source archival data. This study systematically maps 46 relevant works, establishing the first comprehensive systematic mapping study in this domain. It reveals a pronounced disconnect between real-world industrial deployment and prevalent techniques—including traditional machine learning (e.g., SVM, Naïve Bayes), deep learning (e.g., CNN, LSTM), and large language models. Key contributions include identifying three fundamental challenges: lack of industrial validation, absence of multi-dimensional evaluation frameworks, and an overly homogeneous data ecosystem. The paper proposes a forward-looking research agenda guided by practice-driven design, multi-faceted evaluation criteria, and collaborative, community-based data curation—thereby providing a structured benchmark and strategic roadmap for future advancements.
📝 Abstract
Several studies have evaluated automatic techniques for classifying software issue reports to assist practitioners in effectively assigning relevant resources based on the type of issue. Currently, no comprehensive overview of this area has been published. A comprehensive overview will help identify future research directions and provide an extensive collection of potentially relevant existing solutions. This study aims to provide a comprehensive overview of the use of automatic techniques to classify issue reports. We conducted a systematic mapping study and identified 46 studies on the topic. The study results indicate that the existing literature applies various techniques for classifying issue reports, including traditional machine learning and deep learning-based techniques and more advanced large language models. Furthermore, we observe that these studies (a) lack the involvement of practitioners, (b) do not consider other potentially relevant adoption factors beyond prediction accuracy, such as the explainability, scalability, and generalizability of the techniques, and (c) mainly rely on archival data from open-source repositories only. Therefore, future research should focus on real industrial evaluations, consider other potentially relevant adoption factors, and actively involve practitioners.