LaMM: Semi-Supervised Pre-Training of Large-Scale Materials Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Neural network potentials (NNPs) face three key bottlenecks: high pretraining costs on large-scale materials datasets, heavy reliance on expensive DFT-labeled data, and severe load imbalance in distributed training. To address these challenges, this work proposes LaMM, a semi-supervised pretraining framework for materials modeling. Methodologically, LaMM introduces (1) an enhanced denoising self-supervised learning paradigm that effectively leverages ~300 million partially labeled atomic structures to reduce dependency on DFT annotations, and (2) a multi-node dynamic load-balancing algorithm that significantly mitigates computational resource imbalance during distributed training. Experimental results demonstrate that LaMM-pretrained models achieve higher downstream fine-tuning accuracy, faster convergence, and over 40% improvement in pretraining efficiency compared to supervised baselines. By decoupling pretraining from exhaustive DFT labeling and enabling scalable, balanced distributed optimization, LaMM establishes a novel foundational model paradigm for cost-effective, large-scale materials simulation.

Technology Category

Application Category

📝 Abstract

Neural network potentials (NNPs) are crucial for accelerating computational materials science by surrogating density functional theory (DFT) calculations. Improving their accuracy is possible through pre-training and fine-tuning, where an NNP model is first pre-trained on a large-scale dataset and then fine-tuned on a smaller target dataset. However, this approach is computationally expensive, mainly due to the cost of DFT-based dataset labeling and load imbalances during large-scale pre-training. To address this, we propose LaMM, a semi-supervised pre-training method incorporating improved denoising self-supervised learning and a load-balancing algorithm for efficient multi-node training. We demonstrate that our approach effectively leverages a large-scale dataset of $sim$300 million semi-labeled samples to train a single NNP model, resulting in improved fine-tuning performance in terms of both speed and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Improving accuracy of neural network potentials (NNPs) for materials science

Reducing computational costs of DFT-based dataset labeling

Addressing load imbalances during large-scale pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised pre-training for materials models

Denoising self-supervised learning enhancement

Load-balancing algorithm for multi-node training

🔎 Similar Papers

No similar papers found.

Authors to Follow