Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge in protein engineering where experimental data are often limited to sparse or localized mutations, hindering general sequence representations from effectively capturing functional signals. Using adeno-associated virus (AAV) capsids as a case study, the authors systematically evaluate pretrained protein language models—such as ProtBERT and ESM2—for predicting AAV vector viability. They compare amino acid–level and sequence-level embeddings under both supervised and unsupervised settings and investigate the impact of task-specific fine-tuning. The results show that, without fine-tuning, amino acid–level embeddings yield superior performance; however, after fine-tuning on the target task, sequence-level representations substantially outperform other approaches, achieving state-of-the-art results. This work underscores the critical role of fine-tuning in unlocking the full potential of pretrained models when working with localized mutational data.

Technology Category

Application Category

📝 Abstract
Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.
Problem

Research questions and friction points this paper is trying to address.

protein design
pre-trained embeddings
sequence representation
AAV vector
localized mutations
Innovation

Methods, ideas, or system contributions that make the work stand out.

protein embedding
fine-tuning
AAV capsid design
sequence representation
machine-guided protein design
🔎 Similar Papers
No similar papers found.
A
Ana F. Rodrigues
LASIGE, Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal
L
Lucas Ferraz
LASIGE, Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal
L
Laura Balbi
LASIGE, Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal
P
Pedro Giesteira Cotovio
LASIGE, Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal
Catia Pesquita
Catia Pesquita
LASIGE, Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal
AI for ScienceKnowledge GraphsBioinformaticsOntology Matching