Viral Proteins Reveal Geometry of Protein Language Models

๐Ÿ“… 2026-06-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates how protein language models (specifically the ESM series) represent viral protein sequences that are underrepresented in training data. By analyzing the geometric structure of the embedding space, the authors uncover a โ€œnaturalness axisโ€ aligned with masked reconstruction perplexity, revealing an intrinsic mechanism by which the model organizes representations according to sequence naturalness. Combining zero-shot perplexity evaluation with linear separability tests, they demonstrate that despite this global ordering by naturalness, family-specific signals of viral proteins remain linearly separable within dedicated subspaces. These findings indicate that even for data-scarce viral proteins, the model retains discriminative capacity, and that model scaling affects different viral families heterogeneously.
๐Ÿ“ Abstract
Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.
Problem

Research questions and friction points this paper is trying to address.

protein language models
viral proteins
representation
data imbalance
nativeness
Innovation

Methods, ideas, or system contributions that make the work stand out.

protein language models
viral proteins
embedding geometry
nativeness axis
masked reconstruction perplexity