🤖 AI Summary
To address the limited representational capacity of protein language models (PLMs) arising from single-task pretraining, this work introduces the first dual-task collaborative pretraining paradigm that relies solely on raw amino acid sequences. Our method, built upon the Transformer architecture, jointly optimizes two self-supervised objectives: multi-probability masked language modeling and structure-agnostic end-to-end sequence completion. It incorporates multi-granularity random masking and integrates both autoregressive and non-autoregressive prediction heads—without leveraging 3D structural information or any external supervision. This design substantially enhances representation generality and transferability. Empirically, our model achieves state-of-the-art performance across diverse downstream tasks—including secondary structure prediction, fluorescence intensity estimation, GB1 binding affinity prediction, and contact map inference—demonstrating a substantial improvement in learned protein representations.
📝 Abstract
Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.