The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Solvent selection—a critical yet challenging task in chemistry—is hindered by theoretical modeling difficulties and severe data scarcity, especially for continuous-flow processes. Method: We introduce the first temporal solvent selection benchmark dataset tailored for flow chemistry, encompassing 1,200+ continuous process conditions with high-resolution transient flow-control parameters and corresponding yield labels. To address sparse, sequential process spaces, we propose a temporal regression framework integrating domain-informed feature engineering, transfer learning, and active learning. Contribution/Results: Our method significantly improves prediction accuracy for solvent substitution under low-data regimes, reducing mean absolute error by 32% on average. The dataset fills a key gap in AI for Chemistry—namely, benchmarks for time-series-driven, few-shot solvent replacement—and empirically validates multiple AI strategies for sustainable chemical manufacturing. This work advances reproducible, scalable AI benchmarking in chemistry.

Technology Category

Application Category

📝 Abstract

Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.

Problem

Research questions and friction points this paper is trying to address.

Introducing a novel dataset for yield prediction in chemistry

Focusing on solvent selection challenges for machine learning

Benchmarking regression and active learning for sustainable manufacturing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel transient flow dataset for ML

Continuous process conditions sampling

Benchmarking diverse ML approaches

🔎 Similar Papers

A Strong Baseline for Molecular Few-Shot Learning