Predicting O-GlcNAcylation Sites in Mammalian Proteins with Transformers and RNNs Trained with a New Loss Function

📅 2024-02-27

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

158K/year

🤖 AI Summary

To address the poor generalizability and weak identification capability for scarce positive samples in mammalian protein O-GlcNAcylation site prediction, this work proposes a novel sequence modeling framework integrating a Transformer encoder with a dual-unit RNN. We further introduce the first differentiable, weighted focal Matthew’s Correlation Coefficient (MCC) loss function, enabling end-to-end training and fine-tuning while jointly optimizing MCC and F1. This loss is the first to simultaneously and significantly improve both metrics in this task. Evaluated on a large-scale benchmark dataset, our model achieves state-of-the-art performance with F1 = 38.88% and MCC = 38.20%, surpassing all prior methods. The proposed approach establishes a more robust and scalable paradigm for O-GlcNAc site prediction.

Technology Category

Application Category

📝 Abstract

Glycosylation, a protein modification, has multiple essential functional and structural roles. O-GlcNAcylation, a subtype of glycosylation, has the potential to be an important target for therapeutics, but methods to reliably predict O-GlcNAcylation sites had not been available until 2023; a 2021 review correctly noted that published models were insufficient and failed to generalize. Moreover, many are no longer usable. In 2023, a considerably better RNN model with an F$_1$ score of 36.17% and an MCC of 34.57% on a large dataset was published. This article first sought to improve these metrics using transformer encoders. While transformers displayed high performance on this dataset, their performance was inferior to that of the previously published RNN. We then created a new loss function, which we call the weighted focal differentiable MCC, to improve the performance of classification models. RNN models trained with this new function display superior performance to models trained using the weighted cross-entropy loss; this new function can also be used to fine-tune trained models. A two-cell RNN trained with this loss achieves state-of-the-art performance in O-GlcNAcylation site prediction with an F$_1$ score of 38.88% and an MCC of 38.20% on that large dataset.

Problem

Research questions and friction points this paper is trying to address.

Predicting O-GlcNAcylation sites in mammalian proteins

Developing improved models using new loss function

Achieving state-of-the-art performance in site prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer and RNN models

Weighted focal differentiable MCC loss

Fine-tuning trained models

🔎 Similar Papers

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning