Transformers for molecular property prediction: Domain adaptation efficiently improves performance

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study identifies a critical bottleneck in Transformer-based chemical language models for ADME property prediction (aqueous solubility, membrane permeability, microsomal stability, plasma protein binding): performance saturates beyond ~400K pretraining molecules, and pretraining scale exhibits no strong correlation with downstream task performance. To address this, we propose a lightweight domain adaptation strategy requiring only hundreds to thousands of target-domain molecules. Experiments demonstrate statistically significant improvements (p < 0.001) on 3 out of 4 ADME endpoints; the approach—using just 400K pretraining molecules plus minimal domain-specific data—matches or exceeds the performance of MolBERT (1.3M pretraining molecules) and MolFormer (100M), while approaching or surpassing traditional random forest baselines. This work provides the first empirical evidence that small-scale domain adaptation can outperform large-scale pretraining, establishing a new paradigm for efficient, low-cost molecular property modeling.

Technology Category

Application Category

📝 Abstract

Most of the current transformer-based chemical language models are pre-trained on millions to billions of molecules. However, the improvement from such scaling in dataset size is not confidently linked to improved molecular property prediction. The aim of this study is to investigate and overcome some of the limitations of transformer models in predicting molecular properties. Specifically, we examine the impact of pre-training dataset size and diversity on the performance of transformer models and investigate the use of domain adaptation as a technique for improving model performance. First, our findings indicate that increasing pretraining dataset size beyond 400K molecules from the GuacaMol dataset does not result in a significant improvement on four ADME endpoints, namely, solubility, permeability, microsomal stability, and plasma protein binding. Second, our results demonstrate that using domain adaptation by further training the transformer model on a small set of domain-relevant molecules, i.e., a few hundred to a few thousand, using multi-task regression of physicochemical properties was sufficient to significantly improve performance for three out of the four investigated ADME endpoints (P-value<0.001). Finally, we observe that a model pre-trained on 400K molecules and domain adopted on a few hundred/thousand molecules performs similarly (P-value>0.05) to more complicated transformer models like MolBERT(pre-trained on 1.3M molecules) and MolFormer (pre-trained on 100M molecules). A comparison to a random forest model trained on basic physicochemical properties showed similar performance to the examined transformer models. We believe that current transformer models can be improved through further systematic analysis of pre-training and downstream data, pre-training objectives, and scaling laws, ultimately leading to better and more helpful models.

Problem

Research questions and friction points this paper is trying to address.

Investigates limitations of transformer models in molecular property prediction.

Examines impact of pre-training dataset size and diversity on model performance.

Explores domain adaptation to improve transformer model performance on ADME endpoints.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain adaptation enhances transformer model performance.

Pretraining beyond 400K molecules shows limited improvement.

Small domain-relevant datasets significantly boost ADME predictions.

🔎 Similar Papers

An Equivariant Pretrained Transformer for Unified 3D Molecular Representation Learning