Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the computational inefficiency, poor scalability, and reliance on multiple sequence alignment or deep learning models in large-scale SARS-CoV-2 spike protein sequence analysis, this paper proposes an alignment-free, low-overhead hashing-based embedding method. Specifically, it introduces the first application of MurmurHash3 for direct k-mer spectrum hashing, followed by PCA dimensionality reduction and lightweight classification (XGBoost, SVM, or MLP). The method achieves O(n) linear time complexity, with per-sequence embedding completed in milliseconds—99.81% faster than state-of-the-art approaches. On a million-sequence benchmark, it attains 86.4% lineage classification accuracy, demonstrating exceptional trade-offs among computational efficiency, scalability, and discriminative power.

Technology Category

Application Category

📝 Abstract

Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4% classification accuracy while reducing embedding generation time by as much as 99.81%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.

Problem

Research questions and friction points this paper is trying to address.

Develops scalable embedding method for COVID-19 spike sequences

Addresses computational inefficiency in existing viral sequence analysis

Enables fast, accurate lineage classification using machine learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses hashing to generate compact spike sequence embeddings

Trains machine learning models for lineage classification

Achieves high accuracy with drastically reduced generation time

🔎 Similar Papers

No similar papers found.

Authors to Follow