Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora
By: Edward Raff , Ryan R. Curtin , Derek Everett and more
Potential Business Impact:
Finds new computer viruses much faster.
A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.
Similar Papers
Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets
Data Structures and Algorithms
Finds popular word patterns much faster.
Malware Detection based on API Calls: A Reproducibility Study
Cryptography and Security
Finds computer viruses by looking at how programs work.
Malware Classification from Memory Dumps Using Machine Learning, Transformers, and Large Language Models
Machine Learning (CS)
Finds bad computer programs faster and better.