Score: 0

Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization

Published: December 7, 2025 | arXiv ID: 2512.06699v1

By: Karthik Prabhakar

Potential Business Impact:

Helps computers learn faster by feeding them data quicker.

Business Areas:

Predictive Analytics Artificial Intelligence, Data and Analytics, Software

Modern machine learning training is increasingly bottlenecked by data I/O rather than compute. GPUs often sit idle at below 50% utilization waiting for data. This paper presents a machine learning approach to predict I/O performance and recommend optimal storage configurations for ML training pipelines. We collected 141 observations through systematic benchmarking across different storage backends (NVMe SSD, network-attached storage, in-memory filesystems), data formats, and access patterns, covering both low-level I/O operations and full training pipelines. After evaluating seven regression models and three classification approaches, XGBoost achieved the best performance with R-squared of 0.991, predicting I/O throughput within 11.8% error on average. Feature importance analysis revealed that throughput metrics and batch size are the primary performance drivers. This data-driven approach can reduce configuration time from days of trial-and-error to minutes of predictive recommendation. The methodology is reproducible and extensible to other resource management problems in ML systems. Code and data are available at https://github.com/knkarthik01/gpu_storage_ml_project

Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines

Distributed, Parallel, and Cluster Computing

Saves money by guessing computer needs.

30 Sep 2025 0

85%

GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer

Machine Learning (CS)

Predicts computer memory needs for AI faster.

23 Oct 2025 0

85%

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

Databases

Makes computers analyze big data 3x faster.

2 Dec 2025 0

View PDF Login to Bookmark

Page Count

20 pages

Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storage Optimization

Helps computers learn faster by feeding them data quicker.

Technical Abstract

Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines

GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage