Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation
By: Arya Grayeli, Vipin Swarup, Steven E. Noel
Potential Business Impact:
Creates fake computer network data for testing.
Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this paper, we introduce a novel machine learning model for generating high-fidelity synthetic network flow datasets that are representative of real-world networks. Our approach involves the generation of dynamic multigraphs using a stochastic Kronecker graph generator for structure generation and a tabular generative adversarial network for feature generation. We further employ an XGBoost (eXtreme Gradient Boosting) model for graph alignment, ensuring accurate overlay of features onto the generated graph structure. We evaluate our model using new metrics that assess both the accuracy and diversity of the synthetic graphs. Our results demonstrate improvements in accuracy over previous large-scale graph generation methods while maintaining similar efficiency. We also explore the trade-off between accuracy and diversity in synthetic graph dataset creation, a topic not extensively covered in related works. Our contributions include the synthesis and evaluation of large real-world netflow datasets and the definition of new metrics for evaluating synthetic graph generative models.
Similar Papers
Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets
Machine Learning (CS)
Creates fake power grid data for safer training.
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation
Machine Learning (CS)
Makes medical AI work everywhere, fairly.
Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
Machine Learning (Stat)
Makes computer models learn better with fake data.