🤖 AI Summary
Existing methods for classifying DNA/RNA-binding proteins (DBPs/RBPs) suffer from high cross-prediction error rates and poor accuracy in identifying dual RNA/DNA-binding proteins (DRBPs). To address these limitations, we propose a novel multi-label learning framework that integrates the pre-trained protein language model ESM-2 with domain-specific attention mechanisms. Specifically, we introduce a label-aware attention module to enhance class-discriminative representations and a cross-label attention mechanism to explicitly model functional dependencies between DBP and RBP annotations. Sequence features are jointly extracted using CNNs and multi-head self-attention, followed by sigmoid-based multi-label prediction. On benchmark datasets, our method significantly reduces cross-prediction errors and substantially improves DRBP identification accuracy. Moreover, it offers strong interpretability: visualization analyses successfully localize key functional regions and reveal their correspondence with binding specificity.
📝 Abstract
Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross-prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP-PRo which is based on pre-trained protein language model (PLM), attention mechanisms and multi-label learning to mitigate these issues. First, pre-trained PLM such ESM-2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi-head self-attention mechanism is applied for the contextual information while label-aware attention is used to compute class-specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non-NABP) in a multi-label setup. We have also included a novel cross-label attention mechanism to explicitly capture dependencies between DNA- and RNA-binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP-PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP-PRo.