Score: 0

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Published: November 12, 2025 | arXiv ID: 2511.09324v1

By: Mohsen Amiri , Konstantin Avrachenkov , Ibtihal El Mimouni and more

Potential Business Impact:

Lets smart systems learn from changing situations.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee

Machine Learning (CS)

Helps computers learn when things change.

14 Aug 2025 1

88%

Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

Optimization and Control

Helps computers pick the best option faster.

11 Nov 2025 0

88%

Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets

Machine Learning (CS)

Helps doctors choose best treatments with limited money.

24 Oct 2025 0

View PDF Login to Bookmark

Page Count

7 pages

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Lets smart systems learn from changing situations.

Technical Abstract

Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee

Model Predictive Control is almost Optimal for Heterogeneous Restless Multi-armed Bandits

Neural Index Policies for Restless Multi-Action Bandits with Heterogeneous Budgets