🤖 AI Summary
This study systematically evaluates the automated software development capabilities of Llama 2-70B across multiple programming languages (Python, Fortran, C, etc.) in scientific computing contexts, focusing on four tasks: code generation, documentation writing, unit test generation, and code translation. To address the lack of domain-specific benchmarks, we propose the first multidimensional evaluation framework tailored to canonical scientific computing use cases—such as numerical integration and parallel matrix operations—integrating static analysis, dynamic execution validation, functional correctness checking, and human-led quality assessment. Results show that the model reliably generates executable code for simple numerical tasks but exhibits a marked decline in functional correctness for parallel/distributed computations, necessitating extensive manual intervention. Critical bottlenecks are identified in memory management and semantic modeling of MPI/OpenMP constructs. We publicly release a fully reproducible benchmark suite and provide targeted recommendations for improving large language models’ scientific computing proficiency.
📝 Abstract
The rapid evolution of large language models (LLMs) has opened new possibilities for automating various tasks in software development. This paper evaluates the capabilities of the Llama 2-70B model in automating these tasks for scientific applications written in commonly used programming languages. Using representative test problems, we assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between commonly used programming languages. Our comprehensive analysis evaluates the compilation, runtime behavior, and correctness of the generated and translated code. Additionally, we assess the quality of automatically generated code, documentation and unit tests. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations, requiring considerable manual corrections. We identify key limitations and suggest areas for future improvements to better leverage AI-driven automation in scientific computing workflows.