OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the modeling challenge posed by the extreme heterogeneity of cellular signaling systems—governed by multiple factors including age, sex, and disease status, and involving thousands of genes/proteins across diverse cell subtypes. We propose the first Text–Omics Signaling Graph (TOSG) paradigm. Leveraging 120 million single-cell RNA-seq profiles, we construct a large-scale heterogeneous graph dataset that uniquely integrates quantitative omics signals with rich biological semantic annotations (e.g., pathways, drugs, diseases). Methodologically, we innovatively couple text embeddings, graph neural networks (GNNs), and large language models (LLMs) within a unified PyTorch-based end-to-end framework. The released open-source TOSG dataset is extensible and plug-and-play, markedly enhancing interpretability in cellular signaling modeling. It further enables generalizable cross-organ, cross-disease, and cross-age analyses, establishing a novel paradigm for precision medicine.

Technology Category

Application Category

📝 Abstract
Complex cell signaling systems -- governed by varying protein abundances and interactions -- generate diverse cell types across organs. These systems evolve under influences such as age, sex, diet, environmental exposures, and diseases, making them challenging to decode given the involvement of tens of thousands of genes and proteins. Recently, hundreds of millions of single-cell omics data have provided a robust foundation for understanding these signaling networks within various cell subpopulations and conditions. Inspired by the success of large foundation models (for example, large language models and large vision models) pre-trained on massive datasets, we introduce OmniCellTOSG, the first dataset of cell text-omic signaling graphs (TOSGs). Each TOSG represents the signaling network of an individual or meta-cell and is labeled with information such as organ, disease, sex, age, and cell subtype. OmniCellTOSG offers two key contributions. First, it introduces a novel graph model that integrates human-readable annotations -- such as biological functions, cellular locations, signaling pathways, related diseases, and drugs -- with quantitative gene and protein abundance data, enabling graph reasoning to decode cell signaling. This approach calls for new joint models combining large language models and graph neural networks. Second, the dataset is built from single-cell RNA sequencing data of approximately 120 million cells from diverse tissues and conditions (healthy and diseased) and is fully compatible with PyTorch. This facilitates the development of innovative cell signaling models that could transform research in life sciences, healthcare, and precision medicine. The OmniCellTOSG dataset is continuously expanding and will be updated regularly. The dataset and code are available at https://github.com/FuhaiLiAiLab/OmniCellTOSG.
Problem

Research questions and friction points this paper is trying to address.

Decoding complex cell signaling systems influenced by genes and proteins
Integrating text-omic data for joint LLM and GNN modeling
Building scalable datasets for cell signaling research in healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates human-readable annotations with gene data
Combines large language models and graph neural networks
Built from 120 million single-cell RNA sequencing data
🔎 Similar Papers
No similar papers found.
H
Heming Zhang
Washington University in St. Louis
T
Tim Xu
Washington University in St. Louis
D
Dekang Cao
Washington University in St. Louis
S
Shunning Liang
Washington University in St. Louis
L
Lars Schimmelpfennig
Washington University in St. Louis
L
Levi Kaster
Washington University in St. Louis
D
Di Huang
Washington University in St. Louis
Carlos Cruchaga
Carlos Cruchaga
Professor. NeuroGenomics and Informatics Center Director
Neurogenomicsquantitative traits geneticssingle cell RNA-seqneurodegenerationAlzheimer disease
G
Guangfu Li
University of Connecticut
M
Michael Province
Washington University in St. Louis
Y
Yixin Chen
Washington University in St. Louis
P
Philip R. O. Payne
Washington University in St. Louis
Fuhai Li
Fuhai Li
Washington University in St. Louis
AIAgentic AIsystems biologyprecision medicine