Evaluation Guidelines for Empirical Studies in Software Engineering involving LLMs

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) pose severe reproducibility and replicability challenges in empirical software engineering research due to their inherent non-determinism, opaque training data, and rapidly evolving architectures. Method: We propose a systematic framework grounded in systematic literature review and iterative expert consensus. This yields a taxonomy of LLM research types and a dual-tier (“must” and “should”) empirical design and reporting guideline covering model versions, prompt engineering, interaction logs, baseline configurations, and more. Concurrently, we establish an open, community-maintained, continuously updated resource platform. Contribution/Results: The framework has been widely adopted as the de facto standard in the field, substantially enhancing experimental transparency, cross-study comparability, and open science practices. It constitutes the first standardized evaluation and reporting paradigm for LLM-driven software engineering research, enabling rigorous, accountable, and cumulative scientific progress.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to report suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines.org).

Problem

Research questions and friction points this paper is trying to address.

Establishing guidelines for reproducible LLM studies in software engineering

Addressing non-determinism and opacity challenges in LLM research

Providing taxonomy and criteria for transparent LLM study reporting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing taxonomy for LLM-based study types

Providing eight guidelines for empirical study design

Targeting transparency throughout the research process

🔎 Similar Papers

No similar papers found.

Authors to Follow