LFreeDA: Label-Free Drift Adaptation for Windows Malware Detection
By: Adrian Shuai Li, Elisa Bertino
Potential Business Impact:
Teaches computers to spot new computer viruses automatically.
Machine learning (ML)-based malware detectors degrade over time as concept drift introduces new and evolving families unseen during training. Retraining is limited by the cost and time of manual labeling or sandbox analysis. Existing approaches mitigate this via drift detection and selective labeling, but fully label-free adaptation remains largely unexplored. Recent self-training methods use a previously trained model to generate pseudo-labels for unlabeled data and then train a new model on these labels. The unlabeled data are used only for inference and do not participate in training the earlier model. We argue that these unlabeled samples still carry valuable information that can be leveraged when incorporated appropriately into training. This paper introduces LFreeDA, an end-to-end framework that adapts malware classifiers to drift without manual labeling or drift detection. LFreeDA first performs unsupervised domain adaptation on malware images, jointly training on labeled and unlabeled samples to infer pseudo-labels and prune noisy ones. It then adapts a classifier on CFG representations using the labeled and selected pseudo-labeled data, leveraging the scalability of images for pseudo-labeling and the richer semantics of CFGs for final adaptation. Evaluations on the real-world MB-24+ dataset show that LFreeDA improves accuracy by up to 12.6% and F1 by 11.1% over no-adaptation lower bounds, and is only 4% and 3.4% below fully supervised upper bounds in accuracy and F1, respectively. It also matches the performance of state-of-the-art methods provided with ground truth labels for 300 target samples. Additional results on two controlled-drift benchmarks further confirm that LFreeDA maintains malware detection performance as malware evolves without human labeling.
Similar Papers
ADAPT: A Pseudo-labeling Approach to Combat Concept Drift in Malware Detection
Machine Learning (CS)
Finds new computer viruses faster and cheaper.
CITADEL: A Semi-Supervised Active Learning Framework for Malware Detection Under Continuous Distribution Drift
Cryptography and Security
Finds new phone viruses faster and cheaper.
LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis
Cryptography and Security
Helps phone apps spot new viruses better.