scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single-cell RNA sequencing (scRNA-seq) data pose significant challenges—including high dimensionality, extreme sparsity, strong batch effects, class imbalance, and rapidly increasing scale—that hinder cross-center knowledge transfer and integration. To address these, we propose scDD, a latent-code-driven data distillation framework. scDD is the first to jointly encode foundational model priors and raw scRNA-seq data into a variational latent space. It introduces SCDG, a single-step conditional diffusion generator that circumvents multi-step backpropagation-induced gradient decay while preserving both generation fidelity and inter-class separability. Furthermore, we establish the first comprehensive, multi-task evaluation benchmark explicitly designed to assess distillation efficacy. Extensive experiments demonstrate that scDD achieves an average absolute accuracy gain of 7.61% and a relative performance improvement of 15.70% across diverse downstream analyses—significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Single-cell RNA sequencing (scRNA-seq) technology has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. However, the high-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale of the original sequencing data pose significant challenges for multi-center knowledge transfer, data fusion, and cross-validation between scRNA-seq datasets. To address these barriers, (1) we first propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which transfers and distills foundation model knowledge and original dataset information into a compact latent space and generates synthetic scRNA-seq dataset by a generator to replace the original dataset. Then, (2) we propose a single-step conditional diffusion generator named SCDG, which perform single-step gradient back-propagation to help scDD optimize distillation quality and avoid gradient decay caused by multi-step back-propagation. Meanwhile, SCDG ensures the scRNA-seq data characteristics and inter-class discriminability of the synthetic dataset through flexible conditional control and generation quality assurance. Finally, we propose a comprehensive benchmark to evaluate the performance of scRNA-seq dataset distillation in different data analysis tasks. It is validated that our proposed method can achieve 7.61% absolute and 15.70% relative improvement over previous state-of-the-art methods on average task.
Problem

Research questions and friction points this paper is trying to address.

Addresses high-dimensional sparsity in scRNA-seq data.
Reduces batch effect noise and category imbalance issues.
Improves multi-center knowledge transfer and data fusion.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent codes-based scRNA-seq dataset distillation
Single-step conditional diffusion generator (SCDG)
Comprehensive benchmark for distillation performance evaluation
🔎 Similar Papers
No similar papers found.
Zhen Yu
Zhen Yu
School of Translational Medicine & Faculty of IT, Monash University
Digital HealthDermatology AIAging biomarker
J
Jianan Han
AI Research Institute, China Mobile Communications Corporation, Beijing, China
Y
Yang Liu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Qingchao Chen
Qingchao Chen
Assistant Professor, Peking University
Transfer LearningMedical Data AnalysisMulti-modal Human SensingRadar Systems