Information-directed sampling for bandits: a primer
By: Annika Hirling, Giorgio Nicoletti, Antonio Celani
The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.
Similar Papers
Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits
Machine Learning (Stat)
Improves computer learning by guessing better.
Sparse Optimistic Information Directed Sampling
Machine Learning (CS)
Helps computers learn faster with less data.
Incentivized Lipschitz Bandits
Machine Learning (CS)
Helps robots learn faster with smart rewards.