Benchmarking Preprocessing and Integration Methods in Single-Cell Genomics

📅 2026-01-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation in cross-modal integration of single-cell multi-omics data, where method performance is highly dependent on data characteristics. It presents the first comprehensive benchmark of combinations across seven normalization strategies, five integration methods—including Seurat and Harmony—and four dimensionality reduction techniques such as UMAP, spanning the entire pipeline from preprocessing to embedding. Using standardized metrics including Silhouette coefficient, Adjusted Rand Index (ARI), and Calinski–Harabasz index, the work reveals critical compatibilities and context-specific strengths: Harmony demonstrates superior computational efficiency on large-scale datasets, Seurat achieves higher integration accuracy, and UMAP exhibits the broadest compatibility across integration approaches. Importantly, the findings underscore that normalization strategies must be jointly selected with integration methods to optimize performance.

Technology Category

Application Category

📝 Abstract
Single-cell data analysis has the potential to revolutionize personalized medicine by characterizing disease-associated molecular changes at the single-cell level. Advanced single-cell multimodal assays can now simultaneously measure various molecules (e.g., DNA, RNA, Protein) across hundreds of thousands of individual cells, providing a comprehensive molecular readout. A significant analytical challenge is integrating single-cell measurements across different modalities. Various methods have been developed to address this challenge, but there has been no systematic evaluation of these techniques with different preprocessing strategies. This study examines a general pipeline for single-cell data analysis, which includes normalization, data integration, and dimensionality reduction. The performance of different algorithm combinations often depends on the dataset sizes and characteristics. We evaluate six datasets across diverse modalities, tissues, and organisms using three metrics: Silhouette Coefficient Score, Adjusted Rand Index, and Calinski-Harabasz Index. Our experiments involve combinations of seven normalization methods, four dimensional reduction methods, and five integration methods. The results show that Seurat and Harmony excel in data integration, with Harmony being more time-efficient, especially for large datasets. UMAP is the most compatible dimensionality reduction method with the integration techniques, and the choice of normalization method varies depending on the integration method used.
Problem

Research questions and friction points this paper is trying to address.

single-cell genomics
data integration
multimodal data
preprocessing
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

single-cell multimodal integration
benchmarking
data preprocessing
Harmony
Seurat
🔎 Similar Papers
No similar papers found.
A
Ali Anaissi
University of Technology Sydney, Australia
S
Seid Miad Zandavi
University of Sydney, Australia
Weidong Huang
Weidong Huang
Beijing Institute for General Artificial Intelligence
HumanoidWorld ModelsReinforcement Learning
J
Junaid Akram
University of Sydney, Australia
B
Basem Suleiman
University of New South Wales, Australia
Ali Braytee
Ali Braytee
University of Technology Sydney
machine learningoptimizationdata miningcomputational biology
Jie Hua
Jie Hua
School of Computing, Macquarie University
Data VisualisationData ScienceDecision Making