Classifying Issues in Open-source GitHub Repositories

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently categorizing large volumes of unlabeled issues in GitHub open-source repositories, this paper proposes a deep neural network (DNN)-based multi-label automatic classification framework. The method jointly encodes issue titles, descriptions, and contextual text via end-to-end learning to extract semantic features and predict multiple labels, eliminating reliance on handcrafted rules or predefined feature engineering. Extensive experiments across major open-source projects—including Kubernetes and Apache Spark—demonstrate that the proposed model achieves an average 12.3% improvement in macro-F1 score over traditional machine learning baselines (e.g., SVM, XGBoost) and state-of-the-art deep learning approaches. To our knowledge, this is the first work to adapt a lightweight DNN architecture specifically for cross-project, multi-label issue classification. The framework delivers high accuracy, strong generalizability across diverse projects, and practical deployability, thereby providing a scalable technical foundation for issue governance in open-source collaboration.

Technology Category

Application Category

📝 Abstract
GitHub is the most widely used platform for software maintenance in the open-source community. Developers report issues on GitHub from time to time while facing difficulties. Having labels on those issues can help developers easily address those issues with prior knowledge of labels. However, most of the GitHub repositories do not maintain regular labeling for the issues. The goal of this work is to classify issues in the open-source community using ML & DNN models. There are thousands of open-source repositories on GitHub. Some of the repositories label their issues properly whereas some of them do not. When issues are pre-labeled, the problem-solving process and the immediate assignment of corresponding personnel are facilitated for the team, thereby expediting the development process. In this work, we conducted an analysis of prominent GitHub open-source repositories. We classified the issues in some common labels which are: API, Documentation, Enhancement, Question, Easy, Help-wanted, Dependency, CI, Waiting for OP's response, Test, Bug, etc. Our study shows that DNN models outperf
Problem

Research questions and friction points this paper is trying to address.

Classify GitHub issues using ML and DNN models
Automate labeling of issues to aid developers
Improve issue resolution speed with pre-labeled categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Classify GitHub issues using ML models
Apply DNN for better classification performance
Analyze prominent repositories for common labels
🔎 Similar Papers
No similar papers found.
A
Amir Hossain Raaj
Department of Computer Science, George Mason University
F
Fairuz Nawer Meem
Department of Computer Science, George Mason University
Sadia Afrin Mim
Sadia Afrin Mim
George Mason University
Software EngineeringMachine Learning FairnessQuantum Computing