🤖 AI Summary
This study systematically evaluates large language models’ (LLMs) ability to invoke unfamiliar, domain-specific Python libraries (e.g., ParShift, pyclugen) for complex scientific computing tasks under zero-shot conditions, focusing on the reliability of generating executable and functionally correct code. We propose a structured zero-shot prompting framework, integrated with multi-round quantitative execution validation, functional correctness assessment, and error-pattern analysis; additionally, conversational data analysis and synthetic-data clustering serve as complementary evaluation dimensions. To our knowledge, this is the first benchmarking effort targeting LLMs’ capability to interface with third-party scientific computing libraries. Our evaluation reveals that documentation gaps and implementation flaws in target libraries significantly impede code generation. Results show only GPT-4.1 achieves consistent success across all tasks—outperforming all other models substantially. This work establishes a novel benchmark, methodology, and empirical foundation for trustworthy LLM-driven code generation in scientific automation.
📝 Abstract
Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly characterized. This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios: conversational data analysis with the extit{ParShift} library, and synthetic data generation and clustering using extit{pyclugen} and extit{scikit-learn}. Both experiments use structured, zero-shot prompts specifying detailed requirements but omitting in-context examples. Model outputs are evaluated quantitatively for functional correctness and prompt compliance over multiple runs, and qualitatively by analyzing the errors produced when code execution fails. Results show that only a small subset of models consistently generate correct, executable code, with GPT-4.1 standing out as the only model to always succeed in both tasks. In addition to benchmarking LLM performance, this approach helps identify shortcomings in third-party libraries, such as unclear documentation or obscure implementation bugs. Overall, these findings highlight current limitations of LLMs for end-to-end scientific automation and emphasize the need for careful prompt design, comprehensive library documentation, and continued advances in language model capabilities.