🤖 AI Summary
Existing benchmarks for chemical reasoning evaluation focus solely on final answers, making it difficult to detect logical errors in intermediate reasoning steps. To address this limitation, this work proposes ChemCoTBench-V2, a novel benchmark that employs expert-designed structured templates to guide models in generating verifiable intermediate reasoning states. By integrating deterministic chemical rules, reference trajectory alignment, and oracle-verifiable state constraints, the framework enables low-cost, auditable process-level evaluation without requiring human or LLM-based adjudication. This approach is the first to support state-constraint verification and precise error localization in open-ended tasks, revealing a significant discrepancy between answer correctness and reasoning consistency across mainstream large language models. It further facilitates fine-grained model comparison and identification of the first erroneous step in reasoning trajectories.
📝 Abstract
Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.