🤖 AI Summary
In computational drug discovery, ligand generation faces a critical bottleneck in simultaneously ensuring chemical validity and structural compatibility with protein binding pockets.
Method: This paper proposes a lightweight protein-conditioned diffusion generative model. It innovatively integrates a dual-path conditioning mechanism with coarse-step folded structure representation, incorporating protein embedding pooling, local geometric bias injection, masked SMILES sequence generation, and ring-aware chemical validity checking. Selective compilation and gradient accumulation are employed to enhance training stability and long-sequence modeling capability.
Results: Experiments demonstrate substantial improvements in generation efficiency and scalability. The model maintains high molecular validity, novelty, and uniqueness while achieving competitive docking performance and physicochemically plausible properties. It establishes an efficient new paradigm for large-scale, structure-guided de novo drug design.
📝 Abstract
Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored extit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.