🤖 AI Summary
This study addresses the lack of standardized, reproducible evaluation benchmarks for surrogate modeling. To this end, we introduce the first large-scale, fully open-source, automatically normalized, and cross-study reproducible surrogate model evaluation framework. It systematically assesses 29 surrogate models across 60 benchmark functions and 40 real-world datasets. Implemented in R, the framework integrates an automated simulation pipeline, input-adaptive scaling, and a standardized testing protocol; its functionality is publicly distributed via the R package *duqling*, enabling one-click fair comparison and full result reproducibility. Experimental results reveal systematic trade-offs among prediction accuracy, robustness, generalization capability, and computational efficiency across model families. These findings provide empirically grounded guidance for method developers and practitioners, along with best-practice recommendations for surrogate model selection and deployment.
📝 Abstract
Accurate and efficient surrogate modeling is essential for modern computational science, and there are a staggering number of emulation methods to choose from. With new methods being developed all the time, comparing the relative strengths and weaknesses of different methods remains a challenge due to inconsistent benchmarking practices and (sometimes) limited reproducibility and transparency. In this work, we present a large-scale, fully reproducible comparison of $29$ distinct emulators across $60$ canonical test functions and $40$ real emulation datasets. To facilitate rigorous, apples-to-apples comparisons, we introduce the R package exttt{duqling}, which streamlines reproducible simulation studies using a consistent, simple syntax, and automatic internal scaling of inputs. This framework allows researchers to compare emulators in a unified environment and makes it possible to replicate or extend previous studies with minimal effort, even across different publications. Our results provide detailed empirical insight into the strengths and weaknesses of state-of-the-art emulators and offer guidance for both method developers and practitioners selecting a surrogate for new data. We discuss best practices for emulator comparison and highlight how exttt{duqling} can accelerate research in emulator design and application.