BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
Current transcriptional foundation models (TFMs) suffer from poor reproducibility and a lack of consensus on best practices due to highly fragmented training objectives and architectures. To address this, we propose a unified, open-source, modular TFM framework centered on the Whole-Cell Expression Decoder (WCED)—a self-supervised pretraining objective that leverages the [CLS] token to model global gene expression patterns across the entire cell. WCED is the first to be jointly optimized with masked language modeling (MLM) in a multi-task setting. The framework supports diverse input representations—including log-normalized counts and BERT-style tokenization—and integrates seamlessly with CELLxGENE data. Evaluated across十余 single-cell datasets, WCED matches or surpasses state-of-the-art models (e.g., scGPT) in both zero-shot and fine-tuning settings, delivering significant improvements in three core tasks: cell type annotation, batch correction, and perturbation prediction.

Technology Category

Application Category

📝 Abstract
Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individual design choices or evaluate their potential synergies. This hinders the field's ability to converge on best practices and limits the reproducibility of insights across studies. We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives within a single framework. Leveraging this capability, we introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation. In this paper, we describe the framework, supported input representations, and training objectives. We evaluated four model checkpoints pretrained on CELLxGENE using combinations of masked language modeling (MLM), WCED and multitask learning. Using the benchmarking capabilities of BMFM-RNA, we show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT across more than a dozen datasets in both zero-shot and fine-tuning tasks. BMFM-RNA, available as part of the biomed-multi-omics project ( https://github.com/BiomedSciAI/biomed-multi-omic ), offers a reproducible foundation for systematic benchmarking and community-driven exploration of optimal TFM training strategies, enabling the development of more effective tools to leverage the latest advances in AI for understanding cell biology.
Problem

Research questions and friction points this paper is trying to address.

Diverse TFM implementations hinder design choice evaluation
Lack of reproducibility limits insights across transcriptomic studies
Need unified framework for systematic TFM benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular framework for TFM pretraining and fine-tuning
Introduces WCED for global expression patterns
Combines MLM, WCED, and multitask learning
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
Bharath Dandala
Bharath Dandala
IBM
Natural Language ProcessingMachine LearningDeep LearningClinical NLP
Michael M Danziger
Michael M Danziger
IBM Research
foundation modelscausal inferencestatistical physicsnetwork scienceinfrastructure resilience
Ella Barkan
Ella Barkan
IBM Research
Medical ImagingDocument Processing
T
Tanwi Biswas
IBM Research
V
V. Gurev
IBM Research
J
Jianying Hu
IBM Research
M
Matthew Madgwick
IBM Research
A
Akira Koseki
IBM Research
T
T. Kozlovski
IBM Research
Michal Rosen-Zvi
Michal Rosen-Zvi
Director IBM Research
Machine LearningAIHealth Informatics
Y
Y. Shimoni
IBM Research
C
Ching-Huei Tsou
IBM Research