Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee
By: Yu-Heng Hung, Ping-Chun Hsieh, Kai Wang
Potential Business Impact:
Helps computers learn when things change.
Online restless multi-armed bandits (RMABs) typically assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards. However, in real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics, posing significant challenges for traditional RMAB algorithms. In this work, we specifically consider $N$-armd RMAB with non-stationary transition constrained by bounded variation budgets $B$. Our proposed \rmab\; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations. We further establish that \rmab\; achieves $\widetilde{\mathcal{O}}(N^2 B^{\frac{1}{4}} T^{\frac{3}{4}})$ regret bound by leveraging a relaxed definition of regret, providing a foundational theoretical framework for non-stationary RMAB problems for the first time.
Similar Papers
MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment
Machine Learning (CS)
Lets smart systems learn from changing situations.
Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits
Optimization and Control
Helps computers pick the best option faster.
Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets
Machine Learning (CS)
Helps doctors choose best treatments with limited money.