🤖 AI Summary
Efficiently predicting protein properties—such as binding affinity and thermostability—from sparse experimental data remains challenging. This work proposes a novel sequence kernel for Gaussian process models that integrates evolutionary substitution matrices with a local linearity assumption, and innovatively incorporates structure-aware substitution matrices to embed structural priors from foundation models directly into the kernel design. By synergistically leveraging both evolutionary and structural information, the method enables effective multi-task learning and significantly outperforms existing approaches based on large-model embeddings or local supervised learning across multiple protein property prediction tasks. The approach demonstrates superior data efficiency and generalization capability, particularly in low-data regimes.
📝 Abstract
Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.