MalTree: Tracing Malware Evolution from Embeddings at Scale

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of traditional malware detection, which relies on time-consuming reverse engineering and struggles to efficiently uncover evolutionary relationships among malware families. To overcome this, the authors propose a novel approach that integrates phylogenetic methods from bioinformatics—specifically UPGMA and Neighbor-Joining—with multimodal malware embeddings derived from structural, behavioral, and image-based features, enabling automated and scalable malware lineage inference. The resulting large-scale phylogenetic trees are validated for temporal consistency using VirusTotal timestamps, achieving 87% chronological accuracy and revealing significant variation in evolutionary rates across families. Case studies, including Mirai, demonstrate strong alignment between the inferred phylogenies and established threat intelligence, confirming the method’s validity and practical relevance.
📝 Abstract
Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformatics inspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.
Problem

Research questions and friction points this paper is trying to address.

malware evolution
phylogenetic analysis
malware families
evolutionary relationships
proactive defense
Innovation

Methods, ideas, or system contributions that make the work stand out.

phylogenetic analysis
malware evolution
temporal validation
lineage-aware modeling
UPGMA