🤖 AI Summary
Large language models exhibit inconsistent performance in scientific ideation, displaying pronounced "jaggedness" across tasks, domains, and prompts, which undermines their reliability in research applications. To address this, this work introduces SciAidanBench, an open-ended benchmark for scientific question generation, and systematically evaluates 19 foundation models, thereby quantifying and revealing this jaggedness for the first time. The study further proposes a novel paradigm that treats jaggedness not as a limitation but as a resource, leveraging techniques such as meta-model ensembling, inference-time compute allocation, knowledge pooling, and multi-model brainstorming to effectively integrate complementary model strengths. Experiments demonstrate that this approach significantly surpasses any individual model in scientific creativity and uncovers a decoupling between general and scientific creative capabilities.
📝 Abstract
As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.