Living Synthetic Benchmarks: A Neutral and Cumulative Framework for Simulation Studies

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing statistical simulation studies suffer from two major limitations: (1) method developers often design their own data-generating mechanisms (DGMs), introducing evaluation bias; and (2) the lack of standardization across DGMs, benchmark algorithms, and evaluation metrics impedes cross-study comparability and hinders methodological advancement. To address these issues, we propose a Dynamic Synthetic Benchmarking Framework—the first to decouple method development from simulation-based evaluation. This open, modular platform features standardized interfaces, version-controlled components, and automated evaluation pipelines. It enables sustainable integration and updating of DGMs, algorithms, and metrics, thereby substantially improving evaluation neutrality, reproducibility, and cross-study comparability. A prototype system was validated on bias-correction methods, demonstrating its capacity to support systematic, transparent method comparisons and accelerate the identification and adoption of effective techniques.

Technology Category

Application Category

📝 Abstract
Simulation studies are widely used to evaluate statistical methods. However, new methods are often introduced and evaluated using data-generating mechanisms (DGMs) devised by the same authors. This coupling creates misaligned incentives, e.g., the need to demonstrate the superiority of new methods, potentially compromising the neutrality of simulation studies. Furthermore, results of simulation studies are often difficult to compare due to differences in DGMs, competing methods, and performance measures. This fragmentation can lead to conflicting conclusions, hinder methodological progress, and delay the adoption of effective methods. To address these challenges, we introduce the concept of living synthetic benchmarks. The key idea is to disentangle method and simulation study development and continuously update the benchmark whenever a new DGM, method, or performance measure becomes available. This separation benefits the neutrality of method evaluation, emphasizes the development of both methods and DGMs, and enables systematic comparisons. In this paper, we outline a blueprint for building and maintaining such benchmarks, discuss the technical and organizational challenges of implementation, and demonstrate feasibility with a prototype benchmark for publication bias adjustment methods. We conclude that living synthetic benchmarks have the potential to foster neutral, reproducible, and cumulative evaluation of methods, benefiting both method developers and users.
Problem

Research questions and friction points this paper is trying to address.

Addressing biased incentives in statistical simulation studies
Enabling systematic comparisons across different simulation methodologies
Establishing neutral frameworks for cumulative method evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples method and simulation study development
Continuously updates benchmarks with new components
Enables neutral systematic method comparisons
🔎 Similar Papers
No similar papers found.
F
František Bartoš
Department of Psychological Methods, University of Amsterdam
Samuel Pawel
Samuel Pawel
Epidemiology, Biostatistics and Prevention Institute, University of Zurich
StatisticsMeta-Research
B
Björn S. Siepe
Psychological Methods Lab, Department of Psychology, University of Marburg