SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

General-purpose molecular representations lack solvent-specific physical grounding, while hazardous solvent usage constitutes a major climate risk in chemical engineering. Green solvent substitution research faces dual challenges: inadequate solvent representation and scarce labeled data. Method: We propose SoDaDE—a lightweight Transformer-based embedding method trained on small-scale solvent property data—marking the first application of compact Transformers to solvent representation learning. It generates continuous, physics-interpretable fingerprint vectors via self-supervised learning, enabling fine-grained, domain-adapted representation even under data scarcity. Contribution/Results: SoDaDE overcomes the physical semantic deficiency inherent in conventional molecular fingerprints. Evaluated on a state-of-the-art solvent dataset, it significantly outperforms traditional representations in downstream tasks such as reaction yield prediction, demonstrating its efficacy and feasibility for high-performance chemical representation learning in low-data regimes.

Technology Category

Application Category

📝 Abstract

Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.

Problem

Research questions and friction points this paper is trying to address.

Developing solvent-specific embeddings for green chemistry applications

Creating data-driven fingerprints using small transformer models

Addressing limitations of generic molecular representations for solvents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer model creates solvent-specific embeddings

Data-driven fingerprints trained on solvent property dataset

Small dataset workflow enables specialized chemical representations

🔎 Similar Papers

No similar papers found.