Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing protein language models (PLMs) suffer from low data quality—characterized by high sequence homology and insufficient deduplication—and crude representation fusion strategies, typically relying on naive sequence concatenation, thereby limiting performance in protein–protein interaction (PPI) binding affinity prediction. To address these limitations, this work introduces: (1) PPB-Affinity, the first rigorously deduplicated, low-homology (≤30% identity) benchmark dataset specifically curated for PPI affinity prediction; (2) two novel multi-chain representation fusion architectures—Hierarchical Pooling (HP) and Pooling-Attention Addition (PAD)—that replace conventional concatenation with structured, context-aware integration; and (3) a systematic evaluation across state-of-the-art PLMs (ProtT5, ESM2, Ankh, Ankh2, ESM3), employing full fine-tuning and a lightweight ConvBERT head. Experiments demonstrate consistent superiority of HP and PAD over baselines across all PLMs, achieving up to a 12% absolute improvement in Spearman correlation, substantially advancing accuracy in multi-chain PPI affinity prediction.

Technology Category

Application Category

📝 Abstract

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of high-quality datasets for protein-protein interaction prediction

Evaluating advanced architectures for protein language models in binding affinity prediction

Improving prediction accuracy beyond simple concatenation methods for multi-chain PPIs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated PPB-Affinity dataset with 30% identity threshold

Four PLM architectures: EC, SC, HP, PAD

HP and PAD outperform concatenation by 12%

🔎 Similar Papers

No similar papers found.