🤖 AI Summary
A standardized, open benchmark for evaluating the code-generation accuracy of large language models (LLMs) in statistical analysis—particularly for SAS/R code generation—remains absent, hindering their trustworthy adoption in data science.
Method: We introduce StatLLM, the first open-source benchmark for statistical LLM evaluation, comprising 100+ diverse statistical tasks. It incorporates code outputs from state-of-the-art models (e.g., ChatGPT-3.5/4.0, Llama 3.1) and employs a novel five-dimensional expert annotation framework assessing correctness, executability, statistical validity, readability, and conciseness.
Contribution/Results: StatLLM establishes the first closed-loop evaluation pipeline integrating task specifications, multi-model code generation, and fine-grained human assessment. It fills a critical gap in statistical LLM benchmarking, enabling quantitative analysis of statistical coding proficiency, optimization of NLP evaluation metrics, and rigorous testing of next-generation statistical software. The benchmark provides reproducible model rankings and serves as a foundation for advancing reliable, production-grade AI-driven statistical code generation.
📝 Abstract
The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the accuracy of code generated by LLMs. A major challenge in this evaluation lies in the absence of a benchmark dataset for statistical code (e.g., SAS and R). To fill in this gap, this paper introduces StatLLM, an open-source dataset for evaluating the performance of LLMs in statistical analysis. The StatLLM dataset comprises three key components: statistical analysis tasks, LLM-generated SAS code, and human evaluation scores. The first component includes statistical analysis tasks spanning a variety of analyses and datasets, providing problem descriptions, dataset details, and human-verified SAS code. The second component features SAS code generated by ChatGPT 3.5, ChatGPT 4.0, and Llama 3.1 for those tasks. The third component contains evaluation scores from human experts in assessing the correctness, effectiveness, readability, executability, and output accuracy of the LLM-generated code. We also illustrate the unique potential of the established benchmark dataset for (1) evaluating and enhancing natural language processing metrics, (2) assessing and improving LLM performance in statistical coding, and (3) developing and testing of next-generation statistical software - advancements that are crucial for data science and machine learning research.