๐ค AI Summary
Current research in protein generative modeling suffers from significant fragmentation in representation schemes, model architectures, and task formulations, lacking a unified evaluation framework. This work presents the first comprehensive framework that systematically integrates sequence-based, geometric, and multimodal representations by unifying SE(3)-equivariant diffusion, flow matching, and hybrid prediction-generation architectures. We introduce a data partitioning strategy that prevents information leakage, a physics-informed validation mechanism for structural plausibility, and a function-oriented evaluation protocol. Furthermore, we establish a systematic taxonomy and benchmark spanning tasks from structure prediction to proteinโprotein interactions. Our study provides both a methodological foundation and practical guidelines for reliable, function-driven protein engineering, while highlighting critical challenges such as conformational dynamics modeling and biosafety considerations.
๐ Abstract
Generative modeling has become a central paradigm in protein research, extending machine learning beyond structure prediction toward sequence design, backbone generation, inverse folding, and biomolecular interaction modeling. However, the literature remains fragmented across representations, model classes, and task formulations, making it difficult to compare methods or identify appropriate evaluation standards. This survey provides a systematic synthesis of generative AI in protein research, organized around (i) foundational representations spanning sequence, geometric, and multimodal encodings; (ii) generative architectures including $\mathrm{SE}(3)$-equivariant diffusion, flow matching, and hybrid predictor-generator systems; and (iii) task settings from structure prediction and de novo design to protein-ligand and protein-protein interactions. Beyond cataloging methods, we compare assumptions, conditioning mechanisms, and controllability, and we synthesize evaluation best practices that emphasize leakage-aware splits, physical validity checks, and function-oriented benchmarks. We conclude with critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for dual-use biosecurity risks. By unifying architectural advances with practical evaluation standards and responsible development considerations, this survey aims to accelerate the transition from predictive modeling to reliable, function-driven protein engineering.