🤖 AI Summary
The two-dimensional HP lattice protein folding problem is an NP-hard combinatorial optimization task. This paper introduces the Masked Variational Annealing (MVA) framework, which models the sequence-to-structure mapping via a sparse recurrent neural network (RNN), integrates temperature-driven variational annealing sampling with an energy-guided dynamic masking mechanism, and explicitly excludes invalid conformations during autoregressive generation. We propose a novel upper-bound-guided masking training strategy that preserves RNN representational capacity while enhancing search efficiency and conformational feasibility. The method naturally generalizes to three-dimensional and multi-letter alphabets. On the standard 60-residue benchmark set, MVA achieves, for the first time, exact prediction of all known optimal conformations—substantially outperforming conventional heuristic algorithms and state-of-the-art learning-based approaches.
📝 Abstract
Understanding the principles of protein folding is a cornerstone of computational biology, with implications for drug design, bioengineering, and the understanding of fundamental biological processes. Lattice protein folding models offer a simplified yet powerful framework for studying the complexities of protein folding, enabling the exploration of energetically optimal folds under constrained conditions. However, finding these optimal folds is a computationally challenging combinatorial optimization problem. In this work, we introduce a novel upper-bound training scheme that employs masking to identify the lowest-energy folds in two-dimensional Hydrophobic-Polar (HP) lattice protein folding. By leveraging Dilated Recurrent Neural Networks (RNNs) integrated with an annealing process driven by temperature-like fluctuations, our method accurately predicts optimal folds for benchmark systems of up to 60 beads. Our approach also effectively masks invalid folds from being sampled without compromising the autoregressive sampling properties of RNNs. This scheme is generalizable to three spatial dimensions and can be extended to lattice protein models with larger alphabets. Our findings emphasize the potential of advanced machine learning techniques in tackling complex protein folding problems and a broader class of constrained combinatorial optimization challenges.