Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration
By: Youngmin Oh , Jinje Park , Taejin Paik and more
Potential Business Impact:
Helps computers learn better by guessing wisely.
In this paper, we address the contextual dueling bandit problem by proposing variance-aware algorithms that leverage neural networks to approximate nonlinear utility functions. Our approach employs a \textit{variance-aware exploration strategy}, which adaptively accounts for uncertainty in pairwise comparisons while relying only on the gradients with respect to the learnable parameters of the last layer. This design effectively balances the exploration--exploitation tradeoff under both the Upper Confidence Bound (UCB) and Thompson Sampling (TS) frameworks. As a result, under standard assumptions, we establish theoretical guarantees showing that our algorithms achieve sublinear cumulative average regret of order $\bigol\lt(d \sqrt{\sum_{t=1}^T \sigma_t^2} + \sqrt{dT}\rt),$ for sufficiently wide neural networks, where $ d $ is the contextual dimension, $ \sigma_t^2 $ the variance of comparisons at round $ t $, and $ T $ the total number of rounds. We also empirically validate that our approach offers reasonable computational efficiency and achieves sublinear regret on both synthetic tasks with nonlinear utilities and real-world tasks, outperforming existing methods.
Similar Papers
Neural Logistic Bandits
Machine Learning (CS)
Helps computers learn better with fewer guesses.
Variance-Dependent Regret Lower Bounds for Contextual Bandits
Machine Learning (CS)
Helps computers learn better with less guessing.
Neural Contextual Bandits Under Delayed Feedback Constraints
Machine Learning (CS)
Helps computers learn from delayed results.