Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

๐Ÿ“… 2025-12-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the computational inefficiency, poor scalability, and reliance on multiple sequence alignment or deep learning models in large-scale SARS-CoV-2 spike protein sequence analysis, this paper proposes an alignment-free, low-overhead hashing-based embedding method. Specifically, it introduces the first application of MurmurHash3 for direct k-mer spectrum hashing, followed by PCA dimensionality reduction and lightweight classification (XGBoost, SVM, or MLP). The method achieves O(n) linear time complexity, with per-sequence embedding completed in millisecondsโ€”99.81% faster than state-of-the-art approaches. On a million-sequence benchmark, it attains 86.4% lineage classification accuracy, demonstrating exceptional trade-offs among computational efficiency, scalability, and discriminative power.

Technology Category

Application Category

๐Ÿ“ Abstract
Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4% classification accuracy while reducing embedding generation time by as much as 99.81%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.
Problem

Research questions and friction points this paper is trying to address.

Develops scalable embedding method for COVID-19 spike sequences
Addresses computational inefficiency in existing viral sequence analysis
Enables fast, accurate lineage classification using machine learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses hashing to generate compact spike sequence embeddings
Trains machine learning models for lineage classification
Achieves high accuracy with drastically reduced generation time
๐Ÿ”Ž Similar Papers
No similar papers found.