Advancing Symbolic Discovery on Unsupervised Data: A Pre-training Framework for Non-degenerate Implicit Equation Discovery

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Discovering implicit equations $f(mathbf{x}) = 0$ from unsupervised scientific data (e.g., particle trajectories, astronomical observations) remains challenging, as existing symbolic regression methods suffer from degraded performance due to the prevalence of degenerate solutions in the search space. Method: We propose PIE, a pretrained neural-symbolic framework that reformulates implicit equation discovery as a “scientific relation translation” task, enabling end-to-end inference of non-degenerate mathematical structural skeletons. PIE integrates domain-specific scientific priors into synthetic pretraining data and jointly leverages large language models and neural-symbolic reasoning. Contribution/Results: On diverse real-world unsupervised scientific datasets, PIE achieves, for the first time, high-accuracy and robust discovery of non-degenerate implicit equations—significantly outperforming state-of-the-art approaches. Crucially, it breaks the conventional symbolic regression paradigm that relies on labeled input-output pairs, enabling purely unsupervised, physics-informed equation learning.

Technology Category

Application Category

📝 Abstract

Symbolic regression (SR) -- which learns symbolic equations to describe the underlying relation from input-output pairs -- is widely used for scientific discovery. However, a rich set of scientific data from the real world (e.g., particle trajectories and astrophysics) are typically unsupervised, devoid of explicit input-output pairs. In this paper, we focus on symbolic implicit equation discovery, which aims to discover the mathematical relation from unsupervised data that follows an implicit equation $f(mathbf{x}) =0$. However, due to the dense distribution of degenerate solutions (e.g., $f(mathbf{x})=x_i-x_i$) in the discrete search space, most existing SR approaches customized for this task fail to achieve satisfactory performance. To tackle this problem, we introduce a novel pre-training framework -- namely, Pre-trained neural symbolic model for Implicit Equation (PIE) -- to discover implicit equations from unsupervised data. The core idea is that, we formulate the implicit equation discovery on unsupervised scientific data as a translation task and utilize the prior learned from the pre-training dataset to infer non-degenerate skeletons of the underlying relation end-to-end. Extensive experiments shows that, leveraging the prior from a pre-trained language model, PIE effectively tackles the problem of degenerate solutions and significantly outperforms all the existing SR approaches. PIE shows an encouraging step towards general scientific discovery on unsupervised data.

Problem

Research questions and friction points this paper is trying to address.

Discovering implicit equations from unsupervised scientific data

Addressing degenerate solutions in symbolic regression

Enhancing symbolic discovery with pre-trained neural models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained neural symbolic model for implicit equations

Formulates implicit equation discovery as translation task

Utilizes pre-trained language model to avoid degenerate solutions

🔎 Similar Papers

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models