SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
In computational drug discovery, ligand generation faces a critical bottleneck in simultaneously ensuring chemical validity and structural compatibility with protein binding pockets. Method: This paper proposes a lightweight protein-conditioned diffusion generative model. It innovatively integrates a dual-path conditioning mechanism with coarse-step folded structure representation, incorporating protein embedding pooling, local geometric bias injection, masked SMILES sequence generation, and ring-aware chemical validity checking. Selective compilation and gradient accumulation are employed to enhance training stability and long-sequence modeling capability. Results: Experiments demonstrate substantial improvements in generation efficiency and scalability. The model maintains high molecular validity, novelty, and uniqueness while achieving competitive docking performance and physicochemically plausible properties. It establishes an efficient new paradigm for large-scale, structure-guided de novo drug design.

Technology Category

Application Category

📝 Abstract
Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored extit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.
Problem

Research questions and friction points this paper is trying to address.

Generating chemically valid ligands structurally compatible with protein pockets
Overcoming memory-intensive encoding limitations in existing ligand generation methods
Balancing structural awareness with computational efficiency in molecular design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Protein-conditioned diffusion framework with masked SMILES generation
Two conditioning pathways balancing expressivity and efficiency
Coarse-stride folding mechanism reducing quadratic memory costs