Score: 0

Deadline-Aware Online Scheduling for LLM Fine-Tuning with Spot Market Predictions

Published: December 24, 2025 | arXiv ID: 2512.20967v1

By: Linggao Kong , Yuedong Xu , Lei Jiao and more

As foundation models grow in size, fine-tuning them becomes increasingly expensive. While GPU spot instances offer a low-cost alternative to on-demand resources, their volatile prices and availability make deadline-aware scheduling particularly challenging. We tackle this difficulty by using a mix of spot and on-demand instances. Distinctively, we show the predictability of prices and availability in a spot instance market, the power of prediction in enabling cost-efficient scheduling and its sensitivity to estimation errors. An integer programming problem is formulated to capture the use of mixed instances under both the price and availability dynamics. We propose an online allocation algorithm with prediction based on the committed horizon control approach that leverages a \emph{commitment level} to enforce the partial sequence of decisions. When this prediction becomes inaccurate, we further present a complementary online algorithm without predictions. An online policy selection algorithm is developed that learns the best policy from a pool constructed by varying the parameters of both algorithms. We prove that the prediction-based algorithm achieves tighter performance bounds as prediction error decreases, while the policy selection algorithm possesses a regret bound of $\mathcal{O}(\sqrt{T})$. Experimental results demonstrate that our online framework can adaptively select the best policy under varying spot market dynamics and prediction quality, consistently outperforming baselines and improving utility by up to 54.8\%.

Integrated Offline and Online Learning to Solve a Large Class of Scheduling Problems

Optimization and Control

Helps factories finish jobs faster.

8 Jan 2025 1

88%

A Switching Framework for Online Interval Scheduling with Predictions

Machine Learning (CS)

Helps computers pick the best jobs to do

20 Nov 2025 0

88%

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

Distributed, Parallel, and Cluster Computing

Makes training computer brains much faster.

9 Jan 2025 0

View PDF Login to Bookmark

Deadline-Aware Online Scheduling for LLM Fine-Tuning with Spot Market Predictions

Technical Abstract

Integrated Offline and Online Learning to Solve a Large Class of Scheduling Problems

A Switching Framework for Online Interval Scheduling with Predictions

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters