VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In protein engineering, pretrained protein language models (PLMs) face significant barriers—including scarce labeled data, lack of standardized benchmarks, and high interdisciplinary adoption thresholds. To address these challenges, we introduce ProteinEngine, the first unified, open-source platform specifically designed for protein engineering. It features a novel integrated engine that unifies three core capabilities: biological data retrieval, standardized benchmarking, and modular PLM fine-tuning—accessible via both command-line interface and no-code Gradio UI. The platform incorporates over 40 protein datasets and 40 PLMs within a modular, extensible architecture. ProteinEngine substantially lowers the technical barrier for biologists to leverage PLMs while enhancing reproducibility and scalability for computational researchers in protein-related tasks. All code, benchmarks, and documentation are fully open-sourced to foster community-driven collaboration and continuous development.

Technology Category

Application Category

📝 Abstract
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.
Problem

Research questions and friction points this paper is trying to address.

Facilitates protein engineering data retrieval and analysis.
Standardizes task benchmarking for protein language models.
Enables modular fine-tuning of pre-trained protein models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates biological data retrieval and PLM fine-tuning
Supports command-line and no-code interface options
Open-sources 40+ datasets and 40+ protein language models
🔎 Similar Papers
No similar papers found.
Y
Yang Tan
Shanghai Jiao Tong University, China; Shanghai Artificial Intelligence Laboratory, China; East China University of Science and Technology, China
C
Chen Liu
East China University of Science and Technology, China
J
Jingyuan Gao
Shanghai Jiao Tong University, China; Shanghai Artificial Intelligence Laboratory, China; East China University of Science and Technology, China
B
Banghao Wu
Shanghai Jiao Tong University, China
M
Mingchen Li
Shanghai Jiao Tong University, China; Shanghai Artificial Intelligence Laboratory, China; East China University of Science and Technology, China
R
Ruilin Wang
East China University of Science and Technology, China
L
Lingrong Zhang
Shanghai Jiao Tong University, China
H
Huiqun Yu
East China University of Science and Technology, China
G
Guisheng Fan
East China University of Science and Technology, China
L
Liang Hong
Shanghai Jiao Tong University, China; Shanghai Artificial Intelligence Laboratory, China
Bingxin Zhou
Bingxin Zhou
Shanghai Jiao Tong University
Graph Neural NetworksProtein Representation LearningAI4Biology