Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization
By: Karthik Prabhakar
Potential Business Impact:
Helps computers learn faster by feeding them data quicker.
Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and recommend optimal storage configurations for ML training pipelines. We collected 141 observations through systematic benchmarking across different storage backends (NVMe SSD, network-attached storage, in-memory filesystems), data formats, and access patterns, covering both low-level I/O operations and full training pipelines. After evaluating seven regression models and three classification approaches, XGBoost achieved the best performance with R-squared of 0.991, predicting I/O throughput within 11.8% error on average. Feature importance analysis revealed that throughput metrics and batch size are the primary performance drivers. This data-driven approach can reduce configuration time from days of trial-and-error to minutes of predictive recommendation. The methodology is reproducible and extensible to other resource management problems in ML systems. Code and data are available at https://github.com/knkarthik01/gpu_storage_ml_project
Similar Papers
Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines
Distributed, Parallel, and Cluster Computing
Saves money by guessing computer needs.
GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer
Machine Learning (CS)
Predicts computer memory needs for AI faster.
PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage
Databases
Makes computers analyze big data 3x faster.