🤖 AI Summary
Supervised learning for single-step retrosynthesis is severely hindered by the scarcity of labeled chemical data.
Method: We propose an atom-anchored chain-of-thought (CoT) reasoning framework that enables *unsupervised* application of general-purpose large language models (LLMs) to retrosynthesis. Our approach employs atom identifiers for position-aware molecular encoding, explicitly embedding chemical knowledge into the reasoning process; constructs a theory-grounded synthetic dataset; and leverages few-shot and zero-shot prompting to jointly identify reaction sites and transformation types.
Contribution/Results: The framework eliminates reliance on annotated data and establishes an interpretable, structure–knowledge–reasoning mapping. On multiple benchmarks and real drug-like molecules, it achieves ≥90% accuracy in reaction site identification, ≥40% accuracy in named-reaction classification, and ≥74% success rate in precursor prediction—substantially enhancing LLMs’ reliability and generalization capability in complex chemical reasoning.
📝 Abstract
Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($geq90%$), named reaction classes ($geq40%$), and final reactants ($geq74%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.