ChemMLLM: Chemical Multimodal Large Language Model

๐Ÿ“… 2025-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing chemical multimodal large language models (MLLMs) lack unified cross-modal understanding and generation capabilities across text, SMILES strings, and chemical images. To address this, we propose the first unified textโ€“symbolโ€“image trimodal MLLM for chemistry, featuring explicit alignment across all three modalities. We design five novel chemical multimodal tasks and introduce the first domain-specific benchmark and dataset for comprehensive evaluation. Our model extends a large language model architecture by integrating a Vision Transformer (ViT) image encoder and a dedicated SMILES encoder, jointly optimized via instruction tuning and cross-modal alignment losses. It achieves state-of-the-art performance on all five tasks; notably, in molecular image optimization (Property Improvement), it scores 4.27โ€”118.9% higher than GPT-4o. The code is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, in this paper, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 118.9% (4.27 vs 1.95 property improvement). The code is publicly available at https://github.com/bbsbz/ChemMLLM.git.
Problem

Research questions and friction points this paper is trying to address.

Developing a chemical multimodal model for cross-modal tasks
Creating datasets for text, SMILES, and image tasks
Benchmarking performance against general and chemical LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified chemical multimodal model for molecules
Five multimodal tasks across text, SMILES, images
Superior performance in molecule image optimization
๐Ÿ”Ž Similar Papers
No similar papers found.