Data Enrichment for Symbolic Regression Using Diffusion Models

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the significant performance degradation of symbolic regression under sparse, noisy, or physically incomplete spatiotemporal data, where conventional data augmentation often produces samples that violate physical laws. To overcome this limitation, the authors propose a physics-guided latent diffusion framework that, for the first time, embeds physical constraints directly into a diffusion model. The approach employs a variational autoencoder to extract low-dimensional representations and leverages a conditional latent diffusion model to generate synthetic data, augmented with a physics-informed residual corrector to ensure adherence to governing equations. Requiring no additional domain knowledge, the method produces high-fidelity, physically consistent data and substantially improves the equation recovery accuracy of multiple symbolic regression algorithms—including PySR and DEAP—under sparse observational settings across diverse physical systems such as heat conduction, incompressible Navier–Stokes flows, and Newtonian gravitational potentials.
📝 Abstract
Symbolic regression (SR) offers a route to scientific discovery by converting observations into interpretable governing equations. However, despite its promise, its reliability degrades sharply when spatiotemporal measurements are sparse, noisy, or physically incomplete, as commonly occurring in practice. Data enrichment (DE) has been shown to be able to mitigate this limitation, yet additional samples can mislead equation discovery unless they preserve the physical structure of the target system. Such implication of DE requires narrow domain expertise as well as technical fluidity, highly limiting its practical usefulness. In this study, we introduce a physics-guided latent diffusion framework for DE for down the line SR models. The proposed framework combines a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector to complete sparse observations with synthetic fields constrained by governing relations. We evaluate the approach on heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential, using GPLearn, DEAP, and PySR as downstream SR backends. Our results reveal that physics-corrected enrichment consistently improves recovery in sparse regimes across physical dynamics and SR models. These results show that generative enrichment can strengthen equation discovery without additional domain expertise.
Problem

Research questions and friction points this paper is trying to address.

Symbolic Regression
Data Enrichment
Sparse Data
Physics-Informed Modeling
Equation Discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

physics-guided diffusion
symbolic regression
data enrichment
latent diffusion model
physics-informed correction
🔎 Similar Papers