AMix-2: Establishing Protein as a Native Modality in Large Language Models

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work proposes AMix-2, the first large language model to natively integrate protein sequences as a primary modality, enabling unified protein understanding and conditional generation. The method constructs a shared token space between proteins and text and replaces conventional autoregressive generation with block-level diffusion language modeling. By jointly leveraging causal and bidirectional contextual representations and incorporating an iterative refinement mechanism, AMix-2 better captures the structural characteristics inherent to proteins. Evaluated on the newly introduced ProteinArena benchmark, AMix-2 achieves or closely matches the performance of specialized protein models across multiple tasks, significantly outperforming existing general-purpose large language models.

📝 Abstract

We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task-specialized models; and (2) a block-wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left-to-right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time-aware and homology-aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein-specialized models and LLMs. On ProteinArena, AMix-2 outperforms frontier LLMs and demonstrates competitive performance to task-specific protein models. Controlled experiments further show that the diffusion-based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix-2 and ProteinArena to facilitate open research in protein foundation models.

Problem

Research questions and friction points this paper is trying to address.

protein foundation model

large language models

protein sequence design

multimodal learning

biological reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

protein foundation model

unified protein-text modality

block-wise diffusion language modeling