Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Supervised learning for single-step retrosynthesis is severely hindered by the scarcity of labeled chemical data. Method: We propose an atom-anchored chain-of-thought (CoT) reasoning framework that enables *unsupervised* application of general-purpose large language models (LLMs) to retrosynthesis. Our approach employs atom identifiers for position-aware molecular encoding, explicitly embedding chemical knowledge into the reasoning process; constructs a theory-grounded synthetic dataset; and leverages few-shot and zero-shot prompting to jointly identify reaction sites and transformation types. Contribution/Results: The framework eliminates reliance on annotated data and establishes an interpretable, structure–knowledge–reasoning mapping. On multiple benchmarks and real drug-like molecules, it achieves ≥90% accuracy in reaction site identification, ≥40% accuracy in named-reaction classification, and ≥74% success rate in precursor prediction—substantially enhancing LLMs’ reliability and generalization capability in complex chemical reasoning.

Technology Category

Application Category

📝 Abstract

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($geq90%$), named reaction classes ($geq40%$), and final reactants ($geq74%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.

Problem

Research questions and friction points this paper is trying to address.

Addressing chemical data scarcity with unlabeled molecular reasoning

Enhancing LLM performance in retrosynthesis prediction tasks

Generating synthetic datasets by anchoring chemical knowledge to atoms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses atomic identifiers for molecular reasoning

Applies one-shot and few-shot learning techniques

Generates synthetic datasets to overcome data scarcity

🔎 Similar Papers

No similar papers found.

Authors to Follow