Drift-Based Dataset Stability Benchmark
By: Dominik Soukup , Richard Plný , Daniel Vašata and more
Machine learning (ML) represents an efficient and popular approach for network traffic classification. However, network traffic classification is a challenging domain, and trained models may degrade soon after deployment due to the obsolete datasets and quick evolution of computer networks as new or updated protocols appear. Moreover, significant change in the behavior of a traffic type (and, therefore, the underlying features representing the traffic) can produce a large and sudden performance drop of the deployed model, known as a data or concept drift. In most cases, complete retraining is performed, often without further investigation of root causes, as good dataset quality is assumed. However, this is not always the case and further investigation must be performed. This paper proposes a novel methodology to evaluate the stability of datasets and a benchmark workflow that can be used to compare datasets. The proposed framework is based on a concept drift detection method that also uses ML feature weights to boost the detection performance. The benefits of this work are demonstrated on CESNET-TLS-Year22 dataset. We provide the initial dataset stability benchmark that is used to describe dataset stability and weak points to identify the next steps for optimization. Lastly, using the proposed benchmarking methodology, we show the optimization impact on the created dataset variants.
Similar Papers
Empirical Evaluation of Concept Drift in ML-Based Android Malware Detection
Cryptography and Security
Helps phone apps stay safe from new viruses.
Measuring Time Series Forecast Stability for Demand Planning
Machine Learning (CS)
Makes computer predictions more steady and trustworthy.
Towards Reliable AI in 6G: Detecting Concept Drift in Wireless Network
Networking and Internet Architecture
Keeps smart networks working perfectly, even when things change.