Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current large language models struggle to accurately predict gene expression changes under specific cellular perturbations, often conflating intrinsic gene responses with genuine perturbation effects. To address this, this work proposes the Contrastive Relational Evidence Organization (CORE) framework, which reframes perturbation prediction as a contrastive task for the first time. CORE leverages biomedical knowledge graphs to retrieve positive and negative regulatory effects of the same gene across different perturbations, thereby enhancing the model’s ability to reason about perturbation-specific responses. The framework supports two paradigms—CORE-Reasoning and CORE-Voting—and demonstrates substantial improvements: on drug perturbation data, it boosts the aggregate metric of Qwen3.5-9B by up to 28.6%; on general perturbation benchmarks, CORE-Voting elevates the average macro-gene AUROC from random chance to 0.703, underscoring the critical role of relational evidence in improving both prediction accuracy and calibration.

📝 Abstract

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as "virtual cell" simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning

Problem

Research questions and friction points this paper is trying to address.

cellular perturbation

large language models

differential expression

contrastive evidence

gene expression prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive reasoning

cellular perturbation prediction

large language models