Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised learning for single-step retrosynthesis is severely hindered by the scarcity of labeled chemical data. Method: We propose an atom-anchored chain-of-thought (CoT) reasoning framework that enables *unsupervised* application of general-purpose large language models (LLMs) to retrosynthesis. Our approach employs atom identifiers for position-aware molecular encoding, explicitly embedding chemical knowledge into the reasoning process; constructs a theory-grounded synthetic dataset; and leverages few-shot and zero-shot prompting to jointly identify reaction sites and transformation types. Contribution/Results: The framework eliminates reliance on annotated data and establishes an interpretable, structure–knowledge–reasoning mapping. On multiple benchmarks and real drug-like molecules, it achieves ≥90% accuracy in reaction site identification, ≥40% accuracy in named-reaction classification, and ≥74% success rate in precursor prediction—substantially enhancing LLMs’ reliability and generalization capability in complex chemical reasoning.

Technology Category

Application Category

📝 Abstract
Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($geq90%$), named reaction classes ($geq40%$), and final reactants ($geq74%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.
Problem

Research questions and friction points this paper is trying to address.

Addressing chemical data scarcity with unlabeled molecular reasoning
Enhancing LLM performance in retrosynthesis prediction tasks
Generating synthetic datasets by anchoring chemical knowledge to atoms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses atomic identifiers for molecular reasoning
Applies one-shot and few-shot learning techniques
Generates synthetic datasets to overcome data scarcity
🔎 Similar Papers
No similar papers found.
A
Alan Kai Hassen
Machine Learning Research, Pfizer Research and Development, Berlin, Germany
A
Andrius Bernatavicius
Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
A
Antonius P. A. Janssen
Leiden Institute of Chemistry, Leiden University, Leiden, The Netherlands
Mike Preuss
Mike Preuss
Universiteit Leiden
Artificial IntelligenceGamesChemAIOptimizationSocial Media Computing
G
Gerard J. P. van Westen
Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
Djork-Arné Clevert
Djork-Arné Clevert
Pfizer, VP, Machine Learning Research
Drug DiscoveryMachine LearningDeep LearningComputational ChemistryComputational Biology