🤖 AI Summary
Existing chemical reasoning benchmarks suffer from task oversimplification, inadequate process evaluation, and misalignment with expert-level capabilities. To address these limitations, we introduce SUPERChem, a multimodal benchmark comprising 500 expert-crafted, cross-subfield challenging problems. SUPERChem introduces Reasoning Path Fidelity (RPF), a novel scoring metric that quantifies reasoning quality by comparing model-generated solution paths against expert-annotated ground-truth traces. It employs an original content generation and iterative curation pipeline to ensure zero data contamination. Integrating both textual and visual problem formulations, SUPERChem establishes a human–machine comparative evaluation framework enabling analysis of visual modality’s impact on chemical reasoning. Human experts achieve a baseline accuracy of 40.3%, while the strongest evaluated model—GPT-5 (High)—scores only 38.5%, confirming the benchmark’s rigor and discriminative power. SUPERChem is the first benchmark to enable systematic, quantitative assessment of expert-level chemical reasoning processes.
📝 Abstract
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.