π€ AI Summary
This work proposes AMix-2, the first large language model to natively integrate protein sequences as a primary modality, enabling unified protein understanding and conditional generation. The method constructs a shared token space between proteins and text and replaces conventional autoregressive generation with block-level diffusion language modeling. By jointly leveraging causal and bidirectional contextual representations and incorporating an iterative refinement mechanism, AMix-2 better captures the structural characteristics inherent to proteins. Evaluated on the newly introduced ProteinArena benchmark, AMix-2 achieves or closely matches the performance of specialized protein models across multiple tasks, significantly outperforming existing general-purpose large language models.
π Abstract
We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task-specialized models; and (2) a block-wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left-to-right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time-aware and homology-aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein-specialized models and LLMs. On ProteinArena, AMix-2 outperforms frontier LLMs and demonstrates competitive performance to task-specific protein models. Controlled experiments further show that the diffusion-based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix-2 and ProteinArena to facilitate open research in protein foundation models.