Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets

📅 2025-11-19
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Material machine learning models suffer from limited predictive accuracy due to insufficient representation of low-symmetry atomic configurations—such as those induced by thermal excitation, defects, and chemical disorder—in training datasets. To address this, we propose a physics-guided data generation strategy grounded in phonon spectra: lattice vibrational modes inform biased sampling for graph neural networks (GNNs), enabling construction of a compact yet physically rich dataset of atomic configurations. Compared to large-scale random datasets, our approach significantly improves prediction accuracy for electronic structure and mechanical properties of optoelectronic materials at finite temperatures, while enhancing model interpretability by explicitly identifying critical bonding interactions. The core contribution is the establishment of a “physics-driven small-data” paradigm—demonstrating that targeted, physically informed datasets outperform unstructured large ones. This framework provides a generalizable, high-fidelity data curation methodology for material property prediction in energy conversion and photonics applications.

Technology Category

Application Category

📝 Abstract
Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performance of ML models critically depends on the quality, size, and diversity of the training dataset. In materials science, this dependence is particularly important for learning from low-symmetry atomistic configurations that capture thermal excitations, structural defects, and chemical disorder, features that are ubiquitous in real materials but underrepresented in most datasets. The absence of systematic strategies for generating representative training data may therefore limit the predictive power of ML models in technologically critical fields such as energy conversion and photonics. In this work, we assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets: one composed of randomly generated atomic configurations and another constructed using physically informed sampling based on lattice vibrations. As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials under realistic finite-temperature conditions. We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points. Explainability analyses further reveal that high-performing models assign greater weight to chemically meaningful bonds that control property variations, underscoring the importance of physically guided data generation. Overall, this work demonstrates that larger datasets do not necessarily yield better GNN predictive models and introduces a simple and general strategy for efficiently constructing high-quality training data in materials informatics.
Problem

Research questions and friction points this paper is trying to address.

Improving ML prediction of material properties using phonon-informed datasets
Addressing limitations in training data quality for low-symmetry atomistic configurations
Developing efficient strategies for constructing high-quality training data in materials informatics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses phonon-informed sampling for dataset creation
Applies graph neural networks to material property prediction
Prioritizes chemically meaningful bonds through explainability analysis
🔎 Similar Papers
No similar papers found.
P
Pol BenĂ­tez
Department of Physics, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
C
CibrĂĄn LĂłpez
Department of Physics, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
E
Edgardo Saucedo
Department of Electronic Engineering, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
Teruyasu Mizoguchi
Teruyasu Mizoguchi
Institute of Industrial Science, The University of Tokyo
DFT simulationMaterials InformaticsEELSXAFSMaterials Design
C
Claudio Cazorla
Department of Physics, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain