Large Language Models are In-Context Molecule Learners

📅 2024-03-07
🏛️ arXiv.org
📈 Citations: 4
Influential: 2
📄 PDF
🤖 AI Summary
To address weak cross-modal alignment between molecules and natural language, reliance on domain-specific pretraining, and limitations imposed by model scale, this paper proposes In-Context Molecule Adaptation (ICMA), a context-driven molecular adaptation paradigm requiring no additional pretraining. ICMA operates in three stages: (1) hybrid retrieval integrating BM25-based text search and molecular graph retrieval; (2) re-ranking guided by sequence reversal and random walk strategies; and (3) a novel In-Context Molecule Tuning mechanism—introducing molecular knowledge directly into the LLM’s inference context. This work provides the first empirical evidence that large language models inherently possess contextual molecular learning capability. On molecular description generation, ICMA achieves state-of-the-art or competitive performance without extra training corpora or architectural complexity, significantly enhancing cross-modal alignment quality.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.
Problem

Research questions and friction points this paper is trying to address.

LLMs require domain-specific pre-training for molecule-caption tasks
Weak alignment exists between molecular and textual spaces
Current methods demand large-scale LLMs for effective performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Molecule Adaptation (ICMA) paradigm
Hybrid Context Retrieval with BM25 and graph
Post-retrieval Re-ranking and Molecule Tuning
Jiatong Li
Jiatong Li
PhD candidate, Hong Kong Polytechnic University
Natural Language ProcessingBioinformaticsMolecule Discovery
W
Wei Liu
Shanghai Jiao Tong University
Z
Zhihao Ding
The Hong Kong Polytechnic University
W
Wenqi Fan
The Hong Kong Polytechnic University
Yuqiang Li
Yuqiang Li
Central South University
Internal Combustion EngineCombustionEmissionsMechansim
Q
Qing Li
The Hong Kong Polytechnic University