Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

๐Ÿ“… 2025-02-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of existing biomolecular pre-trained models, which neglect cross-omics interactions among DNA, RNA, and proteins. We propose the first unified multi-omics modeling framework grounded in the Central Dogma. Methodologically: (1) we introduce a novel nucleotide representation paradigm driven by reverse transcription and reverse translation; (2) we design a codon-aware tokenizer and a hybrid long-sequence Transformer architecture; and (3) we integrate masked modeling pre-training with knowledge distillation from protein language models to enable end-to-end modelingโ€”from coding sequences to tertiary protein structures. Our framework achieves state-of-the-art performance across 12 downstream tasks spanning genomics, transcriptomics, and proteomics. It significantly improves accuracy in cross-omics functional prediction and enhances biological interpretability through mechanistically grounded representations.

Technology Category

Application Category

๐Ÿ“ Abstract
The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. While modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains under-explored. In this paper, we follow the guidance of the central dogma to redesign both the data and model pipeline and offer a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions of both coding and non-coding regions with masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive Experiments show that Life-Code achieves state-of-the-art performance on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
Problem

Research questions and friction points this paper is trying to address.

Unify multi-omics data integration
Model DNA, RNA, protein interactions
Advance multi-omics analysis interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-omics data pipeline
Codon tokenizer and hybrid architecture
Knowledge distillation for protein structures
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zicheng Liu
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China; Zhejiang University, Hangzhou, China
S
Siyuan Li
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China; Zhejiang University, Hangzhou, China
Z
Zhiyuan Chen
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China; University of Hong Kong, Hong Kong, China
Lei Xin
Lei Xin
The Chinese University of Hong Kong
Machine LearningSystem IdentificationOptimization
F
Fang Wu
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China; Stanford University, CA, USA
C
Chang Yu
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
Q
Qirong Yang
BioMap Research, Beijing, China
Yucheng Guo
Yucheng Guo
Princeton University
Stochastic AnalysisPartial Differential EquationsMathematical Finance
Y
Yujie Yang
BioMap Research, Beijing, China
S
Stan Z. Li
AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China