Data Fusion of Deep Learned Molecular Embeddings for Property Prediction

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degradation of multi-task learning (MTL) performance in molecular property prediction caused by data sparsity, this paper proposes an embedding-fusion MTL framework. Departing from conventional parameter-sharing paradigms, it innovatively fuses task-specific molecular embeddings—learned independently by GCN- or Transformer-based encoders—at the embedding level. The fused representations are concatenated and dimensionally reduced before undergoing joint regression training. This design alleviates MTL’s reliance on strong inter-task correlations and complete data coverage, thereby substantially improving prediction robustness for weakly correlated and infrequently observed properties. On quantum chemistry benchmark datasets and a custom-built sparse dataset, the method achieves an average 18% reduction in mean absolute error (MAE) over standard MTL baselines, and improves prediction accuracy for data-constrained properties by 23% relative to single-task models.

Technology Category

Application Category

📝 Abstract
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many problems data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multi-task learning have been used. The performance of multi-task learning models depends on the strength of the underlying correlations between tasks and the completeness of the dataset. We find that standard multi-task models tend to underperform when trained on sparse datasets with weakly correlated properties. To address this gap, we use data fusion techniques to combine the learned molecular embeddings of various single-task models and trained a multi-task model on this combined embedding. We apply this technique to a widely used benchmark dataset of quantum chemistry data for small molecules as well as a newly compiled sparse dataset of experimental data collected from literature and our own quantum chemistry and thermochemical calculations. The results show that the fused, multi-task models outperform standard multi-task models for sparse datasets and can provide enhanced prediction on data-limited properties compared to single-task models.
Problem

Research questions and friction points this paper is trying to address.

Improving property prediction with sparse data
Enhancing multi-task learning for weakly correlated properties
Combining single-task embeddings for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuse deep learned molecular embeddings
Combine single-task model embeddings
Enhance sparse dataset predictions
🔎 Similar Papers
No similar papers found.
R
Robert J. Appleton
School of Materials Engineering and Birck Nanotechnology Center, Purdue University, West Lafayette, Indiana 47907, USA
B
Brian C. Barnes
U.S. Army Combat Capabilities Development Command Army Research Laboratory, Aberdeen Proving Ground, Maryland 21005, USA
Alejandro Strachan
Alejandro Strachan
Reilly Professor of Materials Engineering, Purdue University
Predictive simulations of materialsMultiscale modelingTheoretical materials science