Scaffold-Aware Generative Augmentation and Reranking for Enhanced Virtual Screening

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Virtual screening faces three key challenges: severe class imbalance (scarcity of active compounds), structural imbalance (overrepresentation of privileged scaffolds), and insufficient candidate molecular diversity. To address these, we propose ScaffAug—a novel scaffold-aware framework integrating generative augmentation and diversity-aware re-ranking. It employs a graph diffusion model conditioned on molecular scaffolds to generate scaffold-aligned molecules; introduces a scaffold-aware sampling algorithm to enhance generation validity; incorporates a model-agnostic self-training module to mitigate label sparsity; and applies a diversity-oriented re-ranking strategy to optimize hit lists. Evaluated across five target classes, ScaffAug consistently outperforms state-of-the-art methods, achieving significant improvements in active compound recall (+12.7%) and scaffold coverage (+34.5%). Ablation studies validate the individual contributions of each component.

Technology Category

Application Category

📝 Abstract
Ligand-based virtual screening (VS) is an essential step in drug discovery that evaluates large chemical libraries to identify compounds that potentially bind to a therapeutic target. However, VS faces three major challenges: class imbalance due to the low active rate, structural imbalance among active molecules where certain scaffolds dominate, and the need to identify structurally diverse active compounds for novel drug development. We introduce ScaffAug, a scaffold-aware VS framework that addresses these challenges through three modules. The augmentation module first generates synthetic data conditioned on scaffolds of actual hits using generative AI, specifically a graph diffusion model. This helps mitigate the class imbalance and furthermore the structural imbalance, due to our proposed scaffold-aware sampling algorithm, designed to produce more samples for active molecules with underrepresented scaffolds. A model-agnostic self-training module is then used to safely integrate the generated synthetic data from our augmentation module with the original labeled data. Lastly, we introduce a reranking module that improves VS by enhancing scaffold diversity in the top recommended set of molecules, while still maintaining and even enhancing the overall general performance of identifying novel, active compounds. We conduct comprehensive computational experiments across five target classes, comparing ScaffAug against existing baseline methods by reporting the performance of multiple evaluation metrics and performing ablation studies on ScaffAug. Overall, this work introduces novel perspectives on effectively enhancing VS by leveraging generative augmentations, reranking, and general scaffold-awareness.
Problem

Research questions and friction points this paper is trying to address.

Addresses class imbalance in virtual screening through generative augmentation
Mitigates structural imbalance by scaffold-aware sampling of active molecules
Enhances scaffold diversity while maintaining identification of active compounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic data using graph diffusion model
Employs scaffold-aware sampling to address structural imbalance
Reranks molecules to enhance scaffold diversity in recommendations
🔎 Similar Papers
No similar papers found.