Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited out-of-distribution (OOD) generalization of existing statistical filters for molecular synthesizability, which struggle to reliably evaluate novel molecules proposed by generative models. The authors propose a multitask learning framework that integrates closed-form physical priors—specifically, the Bertz topological complexity index and MMFF94 force field strain energy—as auxiliary supervision signals into a GINE backbone network. By jointly regressing topological complexity and applying soft constraints on strain energy, the model enhances OOD generalization. Evaluated on the COCONUT natural products OOD test set, the approach achieves a statistically significant AUC improvement of 0.0066 (95% CI [+0.0038, +0.0093]) while maintaining stable in-distribution performance, demonstrating the critical role of physical priors and multi-subset evaluation in ensuring robust conclusions.

📝 Abstract

Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

Problem

Research questions and friction points this paper is trying to address.

out-of-distribution generalization

synthesizability filter

molecular generation

graph neural network

physics-aware priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

physics-aware losses

out-of-distribution generalization

graph neural network