Evaluation Guidelines for Empirical Studies in Software Engineering involving LLMs

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) pose severe reproducibility and replicability challenges in empirical software engineering research due to their inherent non-determinism, opaque training data, and rapidly evolving architectures. Method: We propose a systematic framework grounded in systematic literature review and iterative expert consensus. This yields a taxonomy of LLM research types and a dual-tier (“must” and “should”) empirical design and reporting guideline covering model versions, prompt engineering, interaction logs, baseline configurations, and more. Concurrently, we establish an open, community-maintained, continuously updated resource platform. Contribution/Results: The framework has been widely adopted as the de facto standard in the field, substantially enhancing experimental transparency, cross-study comparability, and open science practices. It constitutes the first standardized evaluation and reporting paradigm for LLM-driven software engineering research, enabling rigorous, accountable, and cumulative scientific progress.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to report suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines.org).
Problem

Research questions and friction points this paper is trying to address.

Establishing guidelines for reproducible LLM studies in software engineering
Addressing non-determinism and opacity challenges in LLM research
Providing taxonomy and criteria for transparent LLM study reporting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing taxonomy for LLM-based study types
Providing eight guidelines for empirical study design
Targeting transparency throughout the research process
🔎 Similar Papers
No similar papers found.
Sebastian Baltes
Sebastian Baltes
University of Bayreuth
software engineeringempirical software engineering
Florian Angermeir
Florian Angermeir
fortiss, Germany and BTH, Sweden
C
Chetan Arora
Monash University, Australia
M
Marvin Muñoz Barón
TU Munich, Germany
Chunyang Chen
Chunyang Chen
Professor at Department of Computer Science, Technical University of Munich
Software EngineeringDeep LearningHuman Computer InteractionLLM4SEGUI
L
Lukas Böhme
Hasso-Plattner-Institut, Germany and University of Potsdam, Germany
Fabio Calefato
Fabio Calefato
Associate Professor, University of Bari
Human Factors in Sw EngSE4AIMining Software RepositoriesPersonality + Sentiment AnalysisOnline Communities
N
Neil Ernst
University of Victoria, Canada
Davide Falessi
Davide Falessi
University of Rome "Tor Vergata"
Software Engineering
Brian Fitzgerald
Brian Fitzgerald
Lero - the Irish Software Research Centre, University of Limerick
Agile methodsopen source softwareinformation systems developmentinner_sourceDevOps
Davide Fucci
Davide Fucci
Software Engineering Research and Education Lab | Blekinge Institute of Technology
Empirical software engineering
Marcos Kalinowski
Marcos Kalinowski
Professor, Pontifical Catholic University of Rio de Janeiro (PUC-Rio)
Empirical Software EngineeringAI EngineeringAI4SEHuman Aspects in Software Engineering
Stefano Lambiase
Stefano Lambiase
Assistant Professor in Software Engineering, Aalborg University in Copenhagen, Denmark
Software EngineeringVideo Games Development
D
Daniel Russo
Aalborg University, Denmark
M
Mircea Lungu
IT University Copenhagen, Denmark
Lutz Prechelt
Lutz Prechelt
Professor of Informatics, Freie Universität Berlin
Software engineeringempirical software engineeringagile methodspair programming
Paul Ralph
Paul Ralph
Professor of Computer Science, Dalhousie University
Software EngineeringResearch MethodsSustainable DevelopmentDesignProject Management
Christoph Treude
Christoph Treude
Associate Professor of Computer Science, Singapore Management University
Software EngineeringEmpirical Software EngineeringHuman-AI InteractionAI for ScienceAI4SE
S
Stefan Wagner
TU Munich, Germany