AssayMatch: Learning to Select Data for Molecular Activity Models

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In drug discovery, cross-source bioactivity data (e.g., from ChEMBL) introduce substantial noise due to heterogeneous experimental protocols, degrading molecular activity prediction performance. To address this, we propose a data selection framework tailored to unlabeled test sets with unknown ground-truth labels. First, we quantify each training assay’s contribution to model predictions via data attribution. Second, leveraging assay text descriptions, we fine-tune a language model to jointly encode semantic similarity and biological plausibility, enabling adaptive selection of relevant training assays for unseen test compounds. Evaluated across 12 model–target pairs, our method outperforms strong baselines on 9, yielding significant improvements in predictive accuracy. It effectively filters out detrimental assays, thereby enhancing model robustness and data utilization efficiency without requiring test-set labels.

Technology Category

Application Category

📝 Abstract
The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.
Problem

Research questions and friction points this paper is trying to address.

Selecting optimal training data for molecular activity prediction models
Reducing noise from incompatible experimental assays in drug discovery
Improving model performance with data-driven assay compatibility assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework selects homogeneous training data for models
Uses attribution scores to fine-tune assay embeddings
Ranks training data using compatibility-aware embeddings
🔎 Similar Papers
No similar papers found.
V
Vincent Fan
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139
Regina Barzilay
Regina Barzilay
Professor of EECS, MIT