SupraBench: A Benchmark for Supramolecular Chemistry

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of systematic evaluation of large language models (LLMs) on supramolecular chemistry host–guest reasoning tasks, such as binding affinity prediction. To this end, we introduce SupraBench, the first domain-specific benchmark comprising five tasks: binding affinity prediction, optimal host selection, solvent identification, host–guest description, and visual molecular recognition. We also release SupraPMC, a domain corpus of 16 million tokens, to facilitate model adaptation. Expert-designed multitask evaluations reveal substantial performance gaps and failure modes across current LLMs. Experiments show that domain pretraining with SupraPMC improves in-distribution regression performance but may degrade the accuracy of structured output formatting, underscoring the need for specialized model development in this scientific domain.

📝 Abstract

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

Problem

Research questions and friction points this paper is trying to address.

supramolecular chemistry

large language models

benchmark

host-guest systems

binding affinity prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

SupraBench

supramolecular chemistry

large language models