Towards Evaluation Guidelines for Empirical Studies involving LLMs

📅 2024-11-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The empirical study of large language models (LLMs) in software engineering lacks systematic evaluation norms across the research lifecycle. Method: This paper proposes the first comprehensive, end-to-end evaluation guideline—covering study design, data curation, tool selection, reproducibility, and ethical considerations—by integrating established software engineering empirical research paradigms, LLM-specific characteristic modeling, and iterative community consensus building. Contribution/Results: It introduces the first structured evaluation standard for LLM-focused empirical research, explicitly defining critical quality dimensions and practical requirements for each phase; establishes a shared benchmark for high-quality LLM research in the software engineering community; and provides foundational support for future study design, peer review, and replication. The guideline has been empirically validated through representative software engineering research scenarios, confirming its applicability and practical utility.

Technology Category

Application Category

📝 Abstract
In the short period since the release of ChatGPT, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of holistic guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of our standards for high-quality empirical studies involving LLMs.
Problem

Research questions and friction points this paper is trying to address.

Lack of LLM evaluation guidelines
Need for rigorous empirical studies
Establishing standards for LLM research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developing LLM evaluation guidelines
Focusing on empirical study rigor
Establishing holistic research standards
🔎 Similar Papers
No similar papers found.
S
Stefan Wagner
TUM School of Communication, Information and Technology, Technical University of Munich, Heilbronn, Germany
M
Marvin Munoz Bar'on
TUM School of Communication, Information and Technology, Technical University of Munich, Heilbronn, Germany
Davide Falessi
Davide Falessi
University of Rome "Tor Vergata"
Software Engineering
Sebastian Baltes
Sebastian Baltes
University of Bayreuth
software engineeringempirical software engineering