Cyber Security Data Science: Machine Learning Methods and their Performance on Imbalanced Datasets
By: Mateo Lopez-Ledezma, Gissel Velarde
Potential Business Impact:
Finds computer threats faster by trying different tricks.
Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Technique, and Self-Paced Ensembling. In the last experiment, we evaluate Self-Paced Ensembling and its number of base classifiers. We found that imbalance learning techniques had positive and negative effects, as reported in related studies. Thus, these techniques should be applied with caution. Besides, we found different best performers for each dataset. Therefore, we recommend testing single classifiers and imbalance learning techniques for each new dataset and application involving imbalanced datasets as is the case in several cyber security applications.
Similar Papers
Enhancing IoT Cyber Attack Detection in the Presence of Highly Imbalanced Data
Machine Learning (CS)
Finds hidden internet dangers in busy networks.
Performance of Machine Learning Classifiers for Anomaly Detection in Cyber Security Applications
Machine Learning (CS)
Finds fake credit card charges better.
Tree Boosting Methods for Balanced andImbalanced Classification and their Robustness Over Time in Risk Assessment
Machine Learning (CS)
Helps computers find rare things in messy data.