Towards Evaluation Guidelines for Empirical Studies involving LLMs

📅 2024-11-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The empirical study of large language models (LLMs) in software engineering lacks systematic evaluation norms across the research lifecycle. Method: This paper proposes the first comprehensive, end-to-end evaluation guideline—covering study design, data curation, tool selection, reproducibility, and ethical considerations—by integrating established software engineering empirical research paradigms, LLM-specific characteristic modeling, and iterative community consensus building. Contribution/Results: It introduces the first structured evaluation standard for LLM-focused empirical research, explicitly defining critical quality dimensions and practical requirements for each phase; establishes a shared benchmark for high-quality LLM research in the software engineering community; and provides foundational support for future study design, peer review, and replication. The guideline has been empirically validated through representative software engineering research scenarios, confirming its applicability and practical utility.

Technology Category

Application Category

📝 Abstract

In the short period since the release of ChatGPT, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of holistic guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of our standards for high-quality empirical studies involving LLMs.

Problem

Research questions and friction points this paper is trying to address.

Lack of LLM evaluation guidelines

Need for rigorous empirical studies

Establishing standards for LLM research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developing LLM evaluation guidelines

Focusing on empirical study rigor

Establishing holistic research standards

🔎 Similar Papers

No similar papers found.

Authors to Follow