Design, Implementation and Evaluation of a Novel Programming Language Topic Classification Workflow

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Identifying fine-grained distributions of programming language concepts—such as operator overloading, virtual functions, inheritance, and templates—in large-scale software systems remains challenging due to their sparse, context-dependent manifestations in source code. To address this, we propose a sliding-window classification framework based on multi-label support vector machines (SVMs), integrating local contextual modeling with window-level voting to accurately localize co-occurring language features. Evaluated on the IBM Project CodeNet dataset, our approach achieves an average F1-score of 0.90 on multi-topic classification and 0.75 on code topic highlighting. Our contributions are threefold: (1) the first multi-label code analysis pipeline explicitly designed for fine-grained localization of core language concepts; (2) a lightweight, language-agnostic, and reusable framework; and (3) practical applicability to technical decision-making, onboarding of new developers, and development of intelligent programming tools.

Technology Category

Application Category

📝 Abstract
As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.
Problem

Research questions and friction points this paper is trying to address.

Classifying programming language topics in source code
Localizing core language concepts like inheritance and templates
Providing empirical insights for code analysis and software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-label SVM with sliding window
Voting strategy for fine-grained localization
Trained on IBM Project CodeNet dataset
🔎 Similar Papers
No similar papers found.
M
Michael Zhang
Queen’s University, Kingston, Canada
Y
Yuan Tian
Queen’s University, Kingston, Canada
Mariam Guizani
Mariam Guizani
Queen's University
Software EngineeringHCIEmpirical StudiesOpen Source