Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences
By: Sarwan Ali, Taslim Murad
Potential Business Impact:
Finds COVID-19 virus types faster and better.
Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.
Similar Papers
PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction
Machine Learning (CS)
Predicts future virus changes to fight sickness.
Incorporating LLM Embeddings for Variation Across the Human Genome
Applications
Helps find genetic causes of diseases faster.
Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing
Machine Learning (CS)
Finds new virus types in poop samples.