🤖 AI Summary
Symbolic regression (SR) suffers from combinatorial explosion of the search space, severe overfitting, and poor model interpretability. This paper proposes a semantic-driven, LLM-augmented evolutionary framework in which large language models serve as semantic operators—guiding candidate expression generation and mutation via natural language rationales, thereby replacing traditional syntax-based, blind search. Evaluated on the FSReD benchmark, our method achieves noise-robust, high-accuracy modeling while significantly improving expression conciseness, physical interpretability, and mechanistic alignment. In a high-energy physics parameterization task, it discovers compact models with explicit physical meaning. The core innovation lies in the first deep integration of LLMs’ semantic reasoning capability into the evolutionary search loop, enabling a paradigm shift from “syntactic evolution” to “concept-driven scientific discovery.”
📝 Abstract
Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter's efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at https://www.ideasearch.cn/.