🤖 AI Summary
This work addresses the challenge of adeno-associated virus (AAV) capsid design for gene therapy, which is hindered by the vast sequence space and limited experimental screening capacity. The authors propose a novel approach that integrates reinforcement learning with protein language models, leveraging pretraining, fine-tuning on experimental data, and a reward mechanism to guide sequence generation. This strategy simultaneously overcomes the constraints of training data distribution while balancing functional viability and sequence novelty. The method substantially outperforms baseline approaches relying solely on fine-tuning and introduces a candidate ranking scheme that incorporates multidimensional biophysical properties, thereby significantly enhancing the efficiency of discovering high-potential AAV capsids.
📝 Abstract
Adeno-associated viral (AAV) vectors are widely used delivery platforms in gene therapy, and the design of improved capsids is key to expanding their therapeutic potential. A central challenge in AAV bioengineering, as in protein design more broadly, is the vast sequence design space relative to the scale of feasible experimental screening. Machine-guided generative approaches provide a powerful means of navigating this landscape and proposing novel protein sequences that satisfy functional constraints. Here, we develop a generative design framework based on protein language models and reinforcement learning to generate highly novel yet functionally plausible AAV capsids. A pretrained model was fine-tuned on experimentally validated capsid sequences to learn patterns associated with viability. Reinforcement learning was then used to guide sequence generation, with a reward function that jointly promoted predicted viability and sequence novelty, thereby enabling exploration beyond regions represented in the training data. Comparative analyses showed that fine-tuning alone produces sequences with high predicted viability but remains biased toward the training distribution, whereas reinforcement learining-guided generation reaches more distant regions of sequence space while maintaining high predicted viability. Finally, we propose a candidate selection strategy that integrates predicted viability, sequence novelty, and biophysical properties to prioritize variants for downstream evaluation. This work establishes a framework for the generative exploration of protein sequence space and advances the application of generative protein language models to AAV bioengineering.