🤖 AI Summary
This work proposes MolDA, a novel framework that overcomes the limitations of traditional autoregressive molecular generation models, which suffer from unidirectional generation bias and often violate global chemical constraints, leading to structurally invalid molecules. MolDA introduces masked diffusion mechanisms into multimodal molecular modeling by replacing the autoregressive backbone with a discrete large language diffusion model. It integrates a hybrid graph neural network encoder with a Q-Former module to achieve cross-modal alignment between structural and linguistic representations. Through bidirectional iterative denoising, the framework refines molecular structures in accordance with learned chemical preferences. This approach transcends the locality inherent in autoregressive generation, significantly improving chemical validity and structural plausibility across molecular generation, captioning, and property prediction tasks, thereby enabling more robust multimodal molecular reasoning.
📝 Abstract
Large Language Models (LLMs) have significantly advanced molecular discovery, but existing multimodal molecular architectures fundamentally rely on autoregressive (AR) backbones. This strict left-to-right inductive bias is sub-optimal for generating chemically valid molecules, as it struggles to account for non-local global constraints (e.g., ring closures) and often accumulates structural errors during sequential generation. To address these limitations, we propose MolDA (Molecular language model with masked Diffusion with mAsking), a novel multimodal framework that replaces the conventional AR backbone with a discrete Large Language Diffusion Model. MolDA extracts comprehensive structural representations using a hybrid graph encoder, which captures both local and global topologies, and aligns them into the language token space via a Q-Former. Furthermore, we mathematically reformulate Molecular Structure Preference Optimization specifically for the masked diffusion. Through bidirectional iterative denoising, MolDA ensures global structural coherence, chemical validity, and robust reasoning across molecule generation, captioning, and property prediction.