🤖 AI Summary
Protein representation learning (PRL) lacks a systematic framework and reproducible evaluation protocols. Method: We propose the first five-dimensional taxonomy—feature-driven, sequence-based, structure-aware, multimodal fusion, and complex relational modeling—to unify over 120 models and 30+ benchmark datasets; construct a cross-modal evaluation resource map integrating deep learning, geometric deep learning, graph neural networks, self-supervised pretraining, and multi-source data alignment for joint modeling of sequences, 3D structures, and functional annotations; and identify interpretability, generalizability, and computational efficiency as three core challenges. Contribution/Results: This work establishes a standardized classification paradigm and reproducible evaluation guideline for PRL, enabling rigorous model comparison and fostering deeper synergy between algorithmic innovation and downstream applications in molecular biology and drug discovery.
📝 Abstract
Proteins are complex biomolecules that play a central role in various biological processes, making them critical targets for breakthroughs in molecular biology, medical research, and drug discovery. Deciphering their intricate, hierarchical structures, and diverse functions is essential for advancing our understanding of life at the molecular level. Protein Representation Learning (PRL) has emerged as a transformative approach, enabling the extraction of meaningful computational representations from protein data to address these challenges. In this paper, we provide a comprehensive review of PRL research, categorizing methodologies into five key areas: feature-based, sequence-based, structure-based, multimodal, and complex-based approaches. To support researchers in this rapidly evolving field, we introduce widely used databases for protein sequences, structures, and functions, which serve as essential resources for model development and evaluation. We also explore the diverse applications of these approaches in multiple domains, demonstrating their broad impact. Finally, we discuss pressing technical challenges and outline future directions to advance PRL, offering insights to inspire continued innovation in this foundational field.