🤖 AI Summary
Poor method naming in scientific Python code—particularly in Jupyter Notebooks—impairs readability and maintainability. Method: This study presents the first systematic evaluation of large language models (LLMs) for improving method name quality in scientific computing, benchmarking GPT-4, Claude-3, Llama-3, and Qwen-2 on 496 manually annotated real-world research code methods. We quantitatively assess their ability to recognize syntactic naming patterns (e.g., verb-initial conventions) and generate domain-aware, terminology-sensitive renaming suggestions. Results: While all LLMs consistently adhere to basic naming conventions, they exhibit significant limitations in domain-specific terminology understanding and cross-method naming consistency; suggestion accuracy falls below 60%, necessitating rigorous human review. This work establishes the first empirical benchmark and methodology for evaluating LLMs in scientific software engineering tasks, delineating the practical boundaries of AI-assisted code refactoring—particularly for method naming—and offering concrete guidance for integrating LLMs into scientific development workflows.
📝 Abstract
Research scientists increasingly rely on implementing software to support their research. While previous research has examined the impact of identifier names on program comprehension in traditional programming environments, limited work has explored this area in scientific software, especially regarding the quality of method names in the code. The recent advances in Large Language Models (LLMs) present new opportunities for automating code analysis tasks, such as identifier name appraisals and recommendations. Our study evaluates four popular LLMs on their ability to analyze grammatical patterns and suggest improvements for 496 method names extracted from Python-based Jupyter Notebooks. Our findings show that the LLMs are somewhat effective in analyzing these method names and generally follow good naming practices, like starting method names with verbs. However, their inconsistent handling of domain-specific terminology and only moderate agreement with human annotations indicate that automated suggestions require human evaluation. This work provides foundational insights for improving the quality of scientific code through AI automation.