🤖 AI Summary
Existing aerodynamic shape optimization methods lack a unified and scalable evaluation framework, making fair comparisons across diverse tasks challenging. To address this gap, this work introduces ShapeBench, an open-source benchmark encompassing eight shape categories and 103 distinct tasks, which provides a standardized API, surrogate models for acceleration, high-fidelity CFD validation, and fixed computational budgets. ShapeBench enables, for the first time, systematic cross-shape and cross-objective evaluation of optimization algorithms and includes ShapeEvolve, a domain-specific evolutionary large language model as a new baseline. Experimental results reveal that optimizer performance exhibits extremely limited generalization across tasks (average Spearman ρ = 0.013), demonstrating that current methods are far from universal and underscoring the urgent need for more robust and generalizable optimization strategies.
📝 Abstract
Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $ρ= 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.