Neural Contextual Bandits Under Delayed Feedback Constraints
By: Mohammadali Moghimi, Sharu Theresa Jose, Shana Moothedath
Potential Business Impact:
Helps computers learn from delayed results.
This paper presents a new algorithm for neural contextual bandits (CBs) that addresses the challenge of delayed reward feedback, where the reward for a chosen action is revealed after a random, unknown delay. This scenario is common in applications such as online recommendation systems and clinical trials, where reward feedback is delayed because the outcomes or results of a user's actions (such as recommendations or treatment responses) take time to manifest and be measured. The proposed algorithm, called Delayed NeuralUCB, uses an upper confidence bound (UCB)-based exploration strategy. Under the assumption of independent and identically distributed sub-exponential reward delays, we derive an upper bound on the cumulative regret over a T-length horizon. We further consider a variant of the algorithm, called Delayed NeuralTS, that uses Thompson Sampling-based exploration. Numerical experiments on real-world datasets, such as MNIST and Mushroom, along with comparisons to benchmark approaches, demonstrate that the proposed algorithms effectively manage varying delays and are well-suited for complex real-world scenarios.
Similar Papers
Multi-User Contextual Cascading Bandits for Personalized Recommendation
Machine Learning (CS)
Shows ads better to many people at once.
Multi-User Contextual Cascading Bandits for Personalized Recommendation
Machine Learning (CS)
Shows ads to people better and faster.
Decentralized Contextual Bandits with Network Adaptivity
Machine Learning (CS)
Helps many computers learn together faster.