From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing molecular language models struggle to effectively capture molecular graph structures and lack the capability for targeted generation. To address these limitations, this work proposes SoftMol, a novel framework that introduces rule-free soft fragment SMILES representations and a block diffusion language model (SoftBD), which uniquely integrates bidirectional diffusion with autoregressive generation mechanisms. Furthermore, SoftMol incorporates a gated Monte Carlo tree search strategy to enable goal-directed molecular design. Evaluated on the ZINC-Curated dataset, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, enhances molecular diversity by 2–3 times, and accelerates inference efficiency by 6.6-fold compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol
Problem

Research questions and friction points this paper is trying to address.

molecular generation
graph-structured molecules
target-aware generation
molecular language models
chemical space
Innovation

Methods, ideas, or system contributions that make the work stand out.

block diffusion
soft fragments
target-aware generation
molecular language model
Monte Carlo tree search
🔎 Similar Papers
No similar papers found.
Q
Qianwei Yang
School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China
Dong Xu
Dong Xu
Shenzhen University
Artificial intelligenceDrug Design
Z
Zhangfan Yang
School of Computer Science, University of Nottingham Ningbo, Ningbo 315100, China
S
Sisi Yuan
School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China
Zexuan Zhu
Zexuan Zhu
Shenzhen University
Evolutionary ComputationMemetic ComputingBioinformaticsMachine Learning
Jianqiang Li
Jianqiang Li
Shenzhen University
CPSRoboticsInternet of Things
J
Junkai Ji
School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China