🤖 AI Summary
Large language models (LLMs) pose severe reproducibility and replicability challenges in empirical software engineering research due to their inherent non-determinism, opaque training data, and rapidly evolving architectures. Method: We propose a systematic framework grounded in systematic literature review and iterative expert consensus. This yields a taxonomy of LLM research types and a dual-tier (“must” and “should”) empirical design and reporting guideline covering model versions, prompt engineering, interaction logs, baseline configurations, and more. Concurrently, we establish an open, community-maintained, continuously updated resource platform. Contribution/Results: The framework has been widely adopted as the de facto standard in the field, substantially enhancing experimental transparency, cross-study comparability, and open science practices. It constitutes the first standardized evaluation and reporting paradigm for LLM-driven software engineering research, enabling rigorous, accountable, and cumulative scientific progress.
📝 Abstract
Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to report suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines.org).