HonestAffinity: Leak-Aware Evaluation of Protein and Pocket Priors for Binding Affinity Prediction

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This study addresses the pervasive issue of data leakage in existing protein–ligand binding affinity prediction methods, where similarity-based splits artificially inflate model generalization performance. To mitigate this, the authors introduce a leakage-aware evaluation protocol and a pairwise ablation strategy, enabling a systematic assessment on the leakage-free LP-PDBBind dataset. They specifically investigate the contributions of protein sequence priors—via frozen ESM-2 embeddings—and learnable pocket positional tokens. Under strict leakage-free evaluation, a lightweight 1D model using only pocket tokens achieves superior performance, whereas models incorporating ESM-2 embeddings excel under conventional (leakage-prone) splits. These findings underscore the profound impact of evaluation protocols on model assessment and challenge the prevailing default practices in the field.
📝 Abstract
Sequence-based deep learning offers a scalable alternative to structure-based scoring for protein-ligand binding affinity prediction. However, progress is hard to interpret when architectural priors are evaluated on canonical PDBbind-style splits that leak similarity classes across folds. We present HonestAffinity, a compact 1D-input predictor to isolate two priors under a leak-aware protocol: frozen ESM-2 (650M) protein embeddings and a learned binary pocket-position marker. We evaluate a multi-scale convolutional/Transformer template in three variants: HonestAffinity-Pocket, HonestAffinity-NoPocket, and HonestAffinity-Pocket-NoESM. All three train on 11,513 LP-PDBBind complexes in ~3 GPU-hours. We benchmark against five baselines on the LP-PDBBind 3-tier no-leak hold-out, CASF-2016, and a CASF-2016 non-train subset. Our central finding is a split-conditioned reversal rather than a uniformly best prior: HonestAffinity-Pocket achieves the best mean Pearson R on validation and CASF-2016 splits, whereas HonestAffinity-Pocket-NoESM achieves the best mean Pearson R on every strict LP no-leak tier (test_cl1-cl3). Both the pocket marker and ESM-2 input improve performance on familiar splits but reduce Pearson R on strict no-leak tiers. We argue models should report paired canonical and leak-proof ablations, and that deployment-regime-matched variants better describe these reversals than a single default. Code and scripts are linked in the footnote; checkpoints will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

binding affinity prediction
data leakage
protein-ligand interaction
evaluation protocol
model prior
Innovation

Methods, ideas, or system contributions that make the work stand out.

leak-aware evaluation
binding affinity prediction
protein language model
pocket prior
sequence-based deep learning
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
J
Junhao Wei
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
B
Baili Lu
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China; College of Animal Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, China
Z
Zhenhong Peng
College of Animal Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, China
W
Wanyan Li
College of Animal Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, China
Zhirong Huang
Zhirong Huang
SLAC and Stanford University
Accelerator PhysicsFree Electron Lasers
Yanxiao Li
Yanxiao Li
National Energy Technology Laboratory
Y
Yifu Zhao
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
D
Dexing Yao
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
Haochen Li
Haochen Li
Tsinghua university
cell-cell communicationsingle-cell genomicsspatial transcriptomics
X
Xudong Ye
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
S
Sio-Kei Im
Macao Polytechnic University, Macao SAR, China
Y
Yapeng Wang
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
X
Xu Yang
Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China