🤖 AI Summary
In protein engineering, pretrained protein language models (PLMs) face significant barriers—including scarce labeled data, lack of standardized benchmarks, and high interdisciplinary adoption thresholds. To address these challenges, we introduce ProteinEngine, the first unified, open-source platform specifically designed for protein engineering. It features a novel integrated engine that unifies three core capabilities: biological data retrieval, standardized benchmarking, and modular PLM fine-tuning—accessible via both command-line interface and no-code Gradio UI. The platform incorporates over 40 protein datasets and 40 PLMs within a modular, extensible architecture. ProteinEngine substantially lowers the technical barrier for biologists to leverage PLMs while enhancing reproducibility and scalability for computational researchers in protein-related tasks. All code, benchmarks, and documentation are fully open-sourced to foster community-driven collaboration and continuous development.
📝 Abstract
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.