SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems
By: Xin Wang , Pietro Lodi Rizzini , Sourav Medya and more
Potential Business Impact:
Predicts computer network slowdowns accurately.
The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.
Similar Papers
Domain-Decomposed Graph Neural Network Surrogate Modeling for Ice Sheets
Machine Learning (CS)
Speeds up computer simulations of ice sheets.
On Approaches to Building Surrogate ODE Models for Diffusion Bridges
Machine Learning (CS)
Makes AI create images much faster and easier.
ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search
Machine Learning (CS)
Makes smart computer brains work faster on phones.