LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study systematically evaluates the automated software development capabilities of Llama 2-70B across multiple programming languages (Python, Fortran, C, etc.) in scientific computing contexts, focusing on four tasks: code generation, documentation writing, unit test generation, and code translation. To address the lack of domain-specific benchmarks, we propose the first multidimensional evaluation framework tailored to canonical scientific computing use cases—such as numerical integration and parallel matrix operations—integrating static analysis, dynamic execution validation, functional correctness checking, and human-led quality assessment. Results show that the model reliably generates executable code for simple numerical tasks but exhibits a marked decline in functional correctness for parallel/distributed computations, necessitating extensive manual intervention. Critical bottlenecks are identified in memory management and semantic modeling of MPI/OpenMP constructs. We publicly release a fully reproducible benchmark suite and provide targeted recommendations for improving large language models’ scientific computing proficiency.

Technology Category

Application Category

📝 Abstract

The rapid evolution of large language models (LLMs) has opened new possibilities for automating various tasks in software development. This paper evaluates the capabilities of the Llama 2-70B model in automating these tasks for scientific applications written in commonly used programming languages. Using representative test problems, we assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between commonly used programming languages. Our comprehensive analysis evaluates the compilation, runtime behavior, and correctness of the generated and translated code. Additionally, we assess the quality of automatically generated code, documentation and unit tests. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations, requiring considerable manual corrections. We identify key limitations and suggest areas for future improvements to better leverage AI-driven automation in scientific computing workflows.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Llama 2-70B's code generation for scientific applications

Assessing code translation between common programming languages

Analyzing quality of generated documentation and unit tests

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates Llama 2-70B for code generation tasks

Assesses code translation across multiple languages

Analyzes compilation and runtime behavior correctness

🔎 Similar Papers

No similar papers found.