SC3: The Multi-Solvent Solubility Challenge and Benchmark

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current multi-solvent solubility prediction models suffer from insufficient reliability, and existing benchmarks are compromised by inconsistent data processing, evaluation metrics that obscure tail errors, and misuse of noisy upper bounds. To address these issues, this work introduces SC3, a new benchmark derived from BigSolDB v2.1, featuring a reproducible data curation pipeline, leakage-proof stratified splits, and a three-tier consensus labeling scheme (Gold/Silver/Bronze). The study also establishes a stricter experimental noise floor of 0.106 log S and proposes novel metrics—PS-RMSE and Z-RMSE—for more rigorous assessment. Systematic evaluation of 31 baseline models reveals that even the best-performing model exhibits a PS-RMSE on the Bronze set five times higher than the noise floor, highlighting a critical performance bottleneck. Furthermore, the utility of predictive uncertainty for calibration diagnostics is empirically validated.

📝 Abstract

Solubility prediction is a standard benchmark in computational chemistry, yet multi-solvent models which reportedly approach the experimental-noise ceiling (i.e. the aleatoric limit) are not yet reliable enough to be deployed. We argue that this gap is partly artefactual: published benchmarks differ in curation policies, evaluate on count-weighted RMSE that hides failure on tail-heavy solvent distributions, and treat the widely cited 0.6-0.8 log S inter-laboratory figure as the aleatoric ceiling even though it reflects worst-case, not expected, disagreement. We introduce SC3, a multi-solvent solubility benchmark built on BigSolDB v2.1 with three contributions: (i) a reproducible curation pipeline yielding 101,535 measurements over 1,327 solutes and 206 solvents, with a recalibrated aleatoric floor of 0.106 log S-roughly 6 times tighter than the conventional figure; (ii) nested Gold/Silver/Bronze consensus tiers with per-point standard deviation, three leakage-checked splits, and a multi-solvent metric suite (PS-RMSE, Z-RMSE); and (iii) a 31-model benchmark across six families, whose best Bronze PS-RMSE sits at 5 times the aleatoric limit, and we observe this is a gap unclosed by any deep alternative tested. We perform three follow-on analyses: data scaling, transfer from quantum-chemistry solvation energies, and feature-level attribution, which demonstrates that calibrated per-point uncertainty is a reusable infrastructure for diagnosis beyond point prediction.

Problem

Research questions and friction points this paper is trying to address.

solubility prediction

multi-solvent

benchmark

aleatoric uncertainty

computational chemistry

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-solvent solubility

aleatoric uncertainty

benchmark curation