SPARS: A Reinforcement Learning-Enabled Simulator for Power Management in HPC Job Scheduling
By: Muhammad Alfian Amrizal , Raka Satya Prasasta , Santana Yuda Pradata and more
High-performance computing (HPC) clusters consume enormous amounts of energy, with idle nodes as a major source of waste. Powering down unused nodes can mitigate this problem, but poorly timed transitions introduce long delays and reduce overall performance. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Served and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records comprehensive metrics-including energy usage, wasted power, job waiting times, and node utilization-and provides Gantt chart visualizations to analyze scheduling dynamics and power transitions. Unlike widely used Batsim-based frameworks that rely on heavy inter-process communication, SPARS provides lightweight event handling and consistent simulation results, making experiments easier to reproduce and extend. Its modular design allows new scheduling heuristics or learning algorithms to be integrated with minimal effort. By providing a flexible, reproducible, and extensible platform, SPARS enables researchers and practitioners to systematically evaluate power-aware scheduling strategies, explore the trade-offs between energy efficiency and performance, and accelerate the development of sustainable HPC operations.
Similar Papers
SPARQ: An Optimization Framework for the Distribution of AI-Intensive Applications under Non-Linear Delay Constraints
Networking and Internet Architecture
Makes apps run faster by smarter resource use.
SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference
Distributed, Parallel, and Cluster Computing
Makes smart devices run faster and use less power.
Prompt-Aware Scheduling for Low-Latency LLM Serving
Machine Learning (CS)
Makes AI answer questions much faster.