Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Scientific modeling faces a fundamental trade-off between interpretability and flexibility: mechanistic models exhibit poor generalization, while data-driven models require abundant labeled data, cannot infer unobserved variables, and lack interpretability. To address this, we propose Simulation-Grounded Neural Networks (SGNNs), the first framework to leverage synthetically generated data from mechanistic simulations as supervisory signals for pretraining neural networks—thereby achieving deep integration of mechanistic guidance and data-driven learning. SGNNs enable counterfactual inference over latent states and provide dynamic, simulation-backed attribution explanations via a “trace-back-to-simulation” mechanism. Evaluated on epidemiological transmission estimation, chemical yield prediction, and ecological modeling, SGNNs achieve substantial improvements: average prediction error reduced by 33%, earlier and more accurate R₀ estimation than conventional methods, and overall predictive capability 2.8× that of baseline models.

Technology Category

Application Category

📝 Abstract

Scientific modeling faces a core limitation: mechanistic models offer interpretability but collapse under real-world complexity, while machine learning models are flexible but require large labeled datasets, cannot infer unobservable quantities, and operate as black boxes. We introduce Simulation-Grounded Neural Networks (SGNNs), a general framework that uses mechanistic simulations as training data for neural networks. SGNNs are pretrained on synthetic corpora spanning diverse model structures, parameter regimes, stochasticity, and observational artifacts. We evaluated SGNNs across scientific disciplines and modeling tasks, and found that SGNNs achieved state-of-the-art results across settings: for prediction tasks, they nearly tripled COVID-19 forecasting skill versus CDC baselines, reduced chemical yield prediction error by one third, and maintained accuracy in ecological forecasting where task specific models failed. For inference tasks, SGNNs also accurately classified the source of information spread in simulated social networks and enabled supervised learning for unobservable targets, such as estimating COVID-19 transmissibility more accurately than traditional methods even in early outbreaks. Finally, SGNNs enable back-to-simulation attribution, a new form of mechanistic interpretability. Given real world input, SGNNs retrieve simulations based on what the model has learned to see as most similar, revealing which underlying dynamics the model believes are active. This provides process-level insight -- what the model thinks is happening -- not just which features mattered. SGNNs unify scientific theory with deep learning flexibility and unlock a new modeling paradigm -- transforming simulations from rigid, post hoc tools into flexible sources of supervision, enabling robust, interpretable inference even when ground truth is missing.

Problem

Research questions and friction points this paper is trying to address.

Overcoming interpretability vs flexibility trade-off in scientific modeling

Reducing reliance on large labeled datasets for machine learning

Enabling accurate inference of unobservable quantities in real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

SGNNs use mechanistic simulations as training data

SGNNs pretrain on diverse synthetic corpora

SGNNs enable back-to-simulation attribution for interpretability

🔎 Similar Papers

No similar papers found.