🤖 AI Summary
This work addresses a critical gap in evaluating code-generating agents, which have predominantly focused on functional correctness while neglecting maintainability—particularly their ability to eliminate code smells and enhance long-term readability and robustness. To bridge this gap, the authors introduce SmellBench, the first systematic benchmark for assessing code refactoring capabilities through the lens of code smells. Built upon real-world open-source projects, SmellBench programmatically injects seven common code smells to create 294 high-quality, diverse refactoring tasks. The benchmark features a three-dimensional evaluation framework measuring functional correctness, accurate smell localization, and refactoring quality. Experimental results reveal that even the best-performing model combination (Qwen Code + Claude Sonnet 4.5) achieves only a score of 50.34, highlighting significant limitations in cross-file reasoning and holistic refactoring capabilities.
📝 Abstract
Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their misuse often produces bloated and disorganized code that impairing readability, extensibility, and robustness. Despite this risk, existing benchmarks largely evaluate functional correctness rather than long-term maintainability of code agents. In this paper, we propose SmellBench, an extensible code refactoring benchmark that proactively injects code smells into clean code snippets from real-world repositories. This design enables the generation of controlled, high-quality, and diverse refactoring cases with human-written ground truth. Specifically, it contains 294 cases spanning 7 popular smell types, 3 difficulty levels, 2 instruction settings across 7 real-world repositories. We further design 3 evaluation aspects covering functional correctness, localization ability, and refactoring quality assessment. Experiments with 2 popular agents and 6 large langauge models (LLMs) show that the best combination - Qwen Code + Claude Sonnet 4.5 - achieved only a 50.34 score of smell elimination. Further analysis reveals that this gap arises from a focus on local code smells and a lack of cross-file understanding, which hinders comprehensive smell elimination.