🤖 AI Summary
Training billion-parameter universal machine-learned interatomic potentials (uMLIPs) is hindered by the absence of efficient parallel frameworks supporting second-order derivatives and computational-communication bottlenecks arising from model scaling. This work proposes MatRIS-MoE, a mixture-of-experts architecture, together with Janus, a high-dimensional distributed training framework, to enable exascale-efficient parallel training with second-order derivative support for the first time. By integrating hardware-aware communication optimizations, the system achieves 1.2 and 1.0 EFLOPS of single-precision performance—24% and 35.5% of theoretical peak—on two exascale supercomputers, respectively, with parallel efficiency exceeding 90%. This reduces training time from weeks to hours, substantially accelerating the development of foundational AI-for-Science models.
📝 Abstract
Universal Machine Learning Interatomic Potentials (uMLIPs), pre-trained on massively diverse datasets encompassing inorganic materials and organic molecules across the entire periodic table, serve as foundational models for quantum-accurate physical simulations. However, uMLIP training requires second-order derivatives, which lack corresponding parallel training frameworks; moreover, scaling to the billion-parameter regime causes explosive growth in computation and communication overhead, making its training a tremendous challenge. We introduce MatRIS-MoE, a billion-parameter Mixture-of-Experts model built upon invariant architecture, and {Janus}, a pioneering high-dimensional distributed training framework for uMLIPs with hardware-aware optimizations. Deployed across two Exascale supercomputers, our code attains a peak performance of 1.2/1.0 EFLOPS (24\%/{35.5\%} of theoretical peak) in single precision at over 90\% parallel efficiency, compressing the training of billion-parameter uMLIPs from weeks to hours. This work establishes a new high-water mark for AI-for-Science (AI4S) foundation models at Exascale and provides essential infrastructure for rapid scientific discovery.