🤖 AI Summary
General-purpose molecular representations lack solvent-specific physical grounding, while hazardous solvent usage constitutes a major climate risk in chemical engineering. Green solvent substitution research faces dual challenges: inadequate solvent representation and scarce labeled data. Method: We propose SoDaDE—a lightweight Transformer-based embedding method trained on small-scale solvent property data—marking the first application of compact Transformers to solvent representation learning. It generates continuous, physics-interpretable fingerprint vectors via self-supervised learning, enabling fine-grained, domain-adapted representation even under data scarcity. Contribution/Results: SoDaDE overcomes the physical semantic deficiency inherent in conventional molecular fingerprints. Evaluated on a state-of-the-art solvent dataset, it significantly outperforms traditional representations in downstream tasks such as reaction yield prediction, demonstrating its efficacy and feasibility for high-performance chemical representation learning in low-data regimes.
📝 Abstract
Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.