The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Solvent selection—a critical yet challenging task in chemistry—is hindered by theoretical modeling difficulties and severe data scarcity, especially for continuous-flow processes. Method: We introduce the first temporal solvent selection benchmark dataset tailored for flow chemistry, encompassing 1,200+ continuous process conditions with high-resolution transient flow-control parameters and corresponding yield labels. To address sparse, sequential process spaces, we propose a temporal regression framework integrating domain-informed feature engineering, transfer learning, and active learning. Contribution/Results: Our method significantly improves prediction accuracy for solvent substitution under low-data regimes, reducing mean absolute error by 32% on average. The dataset fills a key gap in AI for Chemistry—namely, benchmarks for time-series-driven, few-shot solvent replacement—and empirically validates multiple AI strategies for sustainable chemical manufacturing. This work advances reproducible, scalable AI benchmarking in chemistry.

Technology Category

Application Category

📝 Abstract
Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.
Problem

Research questions and friction points this paper is trying to address.

Introducing a novel dataset for yield prediction in chemistry
Focusing on solvent selection challenges for machine learning
Benchmarking regression and active learning for sustainable manufacturing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel transient flow dataset for ML
Continuous process conditions sampling
Benchmarking diverse ML approaches
🔎 Similar Papers
No similar papers found.
T
Toby Boyne
Department of Computing, Imperial College, London, UK
J
Juan S. Campos
Department of Computing, Imperial College, London, UK
B
Becky D. Langdon
Department of Computing, Imperial College, London, UK
J
Jixiang Qing
Department of Computing, Imperial College, London, UK
Y
Yilin Xie
Department of Computing, Imperial College, London, UK
S
Shiqiang Zhang
Department of Computing, Imperial College, London, UK
Calvin Tsay
Calvin Tsay
Imperial College London, Department of Computing
OptimizationMachine LearningProcess Systems EngineeringProcess Control
Ruth Misener
Ruth Misener
Imperial College London
Computational OptimizationMINLPBayesian OptimizationOpen-source Software
D
Daniel W. Davies
Department of Chemistry, Imperial College, London, UK
K
K. Jelfs
Department of Chemistry, Imperial College, London, UK
S
Sarah L Boyall
SOLVE Chemistry, London, UK
T
Thomas M. Dixon
SOLVE Chemistry, London, UK
L
Linden Schrecker
SOLVE Chemistry, London, UK
Jose Pablo Folch
Jose Pablo Folch
SOLVE Chemistry
Machine LearningOptimizationArtificial Intelligence