Towards Data-Efficient Pretraining for Atomic Property Prediction

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional pretraining for atomic property prediction relies heavily on large-scale datasets and computational resources, overlooking the critical role of task relevance over data volume. Method: We propose the Chemical Similarity Index (CSI), a novel graph-based metric that quantifies alignment between pretraining data and downstream tasks—marking the first such formalization. Guided by CSI, we design an optimal subset selection strategy integrated with a graph neural network pretraining framework to construct a compact, high-fidelity pretraining set. Contribution/Results: Our approach achieves state-of-the-art performance on multiple atomic-level property prediction benchmarks—outperforming large-scale mixed baselines like JMP—while incurring only 1/24 the computational cost. Crucially, we demonstrate that indiscriminate data expansion degrades model performance, establishing a “small but precise” paradigm for molecular pretraining grounded in task-aware data curation.

Technology Category

Application Category

📝 Abstract
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
Problem

Research questions and friction points this paper is trying to address.

Optimizes pretraining for atomic property prediction.
Reduces computational cost via task-relevant datasets.
Introduces CSI to align datasets with tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chemical Similarity Index (CSI)
Task-relevant dataset selection
Reduced computational cost
🔎 Similar Papers
No similar papers found.