Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V3

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of large language models (LLMs) for multilingual code smell detection lack cross-language standardized benchmarks and joint cost–performance analysis. Method: We construct the first annotated, multilingual dataset covering Java, Python, JavaScript, and C++, propose a three-tier evaluation matrix (overall/category/specific-smell level), and integrate F1-score (macro-averaged), recall, and token-level inference cost modeling—comparing GPT-4 and DeepSeek-V3 against SonarQube as a static-analysis baseline. Contribution/Results: This work introduces the first cross-language LLM evaluation framework for code smells; designs a fine-grained, three-tier assessment methodology; and pioneers joint quantification of detection performance and inference cost. Results show GPT-4 achieves a 12.3-percentage-point higher macro-F1 than DeepSeek-V3, yet the latter incurs 68% lower token cost. Both LLMs significantly outperform SonarQube on complex logic-related smells (e.g., Feature Envy), demonstrating superior contextual reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Determining the most effective Large Language Model for code smell detection presents a complex challenge. This study introduces a structured methodology and evaluation matrix to tackle this issue, leveraging a curated dataset of code samples consistently annotated with known smells. The dataset spans four prominent programming languages Java, Python, JavaScript, and C++; allowing for cross language comparison. We benchmark two state of the art LLMs, OpenAI GPT 4.0 and DeepSeek-V3, using precision, recall, and F1 score as evaluation metrics. Our analysis covers three levels of detail: overall performance, category level performance, and individual code smell type performance. Additionally, we explore cost effectiveness by comparing the token based detection approach of GPT 4.0 with the pattern-matching techniques employed by DeepSeek V3. The study also includes a cost analysis relative to traditional static analysis tools such as SonarQube. The findings offer valuable guidance for practitioners in selecting an efficient, cost effective solution for automated code smell detection
Problem

Research questions and friction points this paper is trying to address.

Compare LLMs for detecting code smells across languages
Evaluate GPT-4.0 vs DeepSeek-V3 using precision/recall metrics
Analyze cost-effectiveness vs traditional static analysis tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured methodology for LLM evaluation
Cross-language dataset with annotated smells
Token vs pattern-matching cost analysis
Ahmed R. Sadik
Ahmed R. Sadik
Honda Research Institute - EU
S
Siddhata Govind
Honda Research Institute Europe, Offenbach am Main, Germany