🤖 AI Summary
Existing evaluations of large language models (LLMs) for multilingual code smell detection lack cross-language standardized benchmarks and joint cost–performance analysis. Method: We construct the first annotated, multilingual dataset covering Java, Python, JavaScript, and C++, propose a three-tier evaluation matrix (overall/category/specific-smell level), and integrate F1-score (macro-averaged), recall, and token-level inference cost modeling—comparing GPT-4 and DeepSeek-V3 against SonarQube as a static-analysis baseline. Contribution/Results: This work introduces the first cross-language LLM evaluation framework for code smells; designs a fine-grained, three-tier assessment methodology; and pioneers joint quantification of detection performance and inference cost. Results show GPT-4 achieves a 12.3-percentage-point higher macro-F1 than DeepSeek-V3, yet the latter incurs 68% lower token cost. Both LLMs significantly outperform SonarQube on complex logic-related smells (e.g., Feature Envy), demonstrating superior contextual reasoning capabilities.
📝 Abstract
Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection