The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of predicting cellular transcriptomic responses to unseen compounds under a chemically structured train/test split, where existing complex models often fail to surpass simple baselines. The authors propose a staged approach: first benchmarking baseline performance, then introducing a non-parametric retrieval mechanism weighted by Tanimoto similarity, and finally combining frozen chemical embeddings with retrieved features to predict residuals, augmented by an uncertainty head and a gene program module. Crucially, the study demonstrates—on real drug response data—that the choice of evaluation metric can entirely invert model rankings, underscoring the importance of metric calibration. On the VCPI THP-1 dataset, the proposed method significantly outperforms a linear baseline under the official wMSE metric (ΔwMSE = −0.012, p < 10⁻⁴), whereas conventional proxy metrics erroneously rank the worst-performing model as best.

📝 Abstract

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

Problem

Research questions and friction points this paper is trying to address.

drug-response prediction

unseen chemistry

evaluation metric

model ranking

transcriptome response

Innovation

Methods, ideas, or system contributions that make the work stand out.

metric sensitivity

drug-response prediction

out-of-distribution generalization