🤖 AI Summary
Low automation and heavy human dependence in Extract Method Refactoring (EMR) hinder improvements in code readability and maintainability. This paper proposes Recursive Critique and Improvement (RCI), a novel prompting strategy that enables end-to-end automated EMR for Python code using open-source code large language models (e.g., DeepSeek-Coder, Qwen2.5-Coder) with 3B–8B parameters. Evaluation employs both automated metrics (lines of code, cyclomatic complexity, test pass rate) and developer surveys. RCI significantly outperforms single-shot prompting: average method length decreases from 12.1 to 5.6 lines; cyclomatic complexity drops from 4.6 to 3.3; maximum test pass rate reaches 82.9%; and developer acceptance exceeds 70%. Crucially, the study reveals systematic discrepancies between traditional static metrics and human judgment, challenging conventional LLM evaluation paradigms for code refactoring and providing empirical grounding for more human-aligned assessment frameworks.
📝 Abstract
Automating the Extract Method refactoring (EMR) remains challenging and largely manual despite its importance in improving code readability and maintainability. Recent advances in open-source, resource-efficient Large Language Models (LLMs) offer promising new approaches for automating such high-level tasks. In this work, we critically evaluate five state-of-the-art open-source LLMs, spanning 3B to 8B parameter sizes, on the EMR task for Python code. We systematically assess functional correctness and code quality using automated metrics and investigate the impact of prompting strategies by comparing one-shot prompting to a Recursive criticism and improvement (RCI) approach. RCI-based prompting consistently outperforms one-shot prompting in test pass rates and refactoring quality. The best-performing models, Deepseek-Coder-RCI and Qwen2.5-Coder-RCI, achieve test pass percentage (TPP) scores of 0.829 and 0.808, while reducing lines of code (LOC) per method from 12.103 to 6.192 and 5.577, and cyclomatic complexity (CC) from 4.602 to 3.453 and 3.294, respectively. A developer survey on RCI-generated refactorings shows over 70% acceptance, with Qwen2.5-Coder rated highest across all evaluation criteria. In contrast, the original code scored below neutral, particularly in readability and maintainability, underscoring the benefits of automated refactoring guided by quality prompts. While traditional metrics like CC and LOC provide useful signals, they often diverge from human judgments, emphasizing the need for human-in-the-loop evaluation. Our open-source benchmark offers a foundation for future research on automated refactoring with LLMs.