🤖 AI Summary
Identifying fine-grained distributions of programming language concepts—such as operator overloading, virtual functions, inheritance, and templates—in large-scale software systems remains challenging due to their sparse, context-dependent manifestations in source code. To address this, we propose a sliding-window classification framework based on multi-label support vector machines (SVMs), integrating local contextual modeling with window-level voting to accurately localize co-occurring language features. Evaluated on the IBM Project CodeNet dataset, our approach achieves an average F1-score of 0.90 on multi-topic classification and 0.75 on code topic highlighting. Our contributions are threefold: (1) the first multi-label code analysis pipeline explicitly designed for fine-grained localization of core language concepts; (2) a lightweight, language-agnostic, and reusable framework; and (3) practical applicability to technical decision-making, onboarding of new developers, and development of intelligent programming tools.
📝 Abstract
As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.