Design, Implementation and Evaluation of a Novel Programming Language Topic Classification Workflow

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Identifying fine-grained distributions of programming language concepts—such as operator overloading, virtual functions, inheritance, and templates—in large-scale software systems remains challenging due to their sparse, context-dependent manifestations in source code. To address this, we propose a sliding-window classification framework based on multi-label support vector machines (SVMs), integrating local contextual modeling with window-level voting to accurately localize co-occurring language features. Evaluated on the IBM Project CodeNet dataset, our approach achieves an average F1-score of 0.90 on multi-topic classification and 0.75 on code topic highlighting. Our contributions are threefold: (1) the first multi-label code analysis pipeline explicitly designed for fine-grained localization of core language concepts; (2) a lightweight, language-agnostic, and reusable framework; and (3) practical applicability to technical decision-making, onboarding of new developers, and development of intelligent programming tools.

Technology Category

Application Category

📝 Abstract

As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.

Problem

Research questions and friction points this paper is trying to address.

Classifying programming language topics in source code

Localizing core language concepts like inheritance and templates

Providing empirical insights for code analysis and software engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-label SVM with sliding window

Voting strategy for fine-grained localization

Trained on IBM Project CodeNet dataset

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models