Group-Sensitive Offline Contextual Bandits
By: Yihong Guo , Junjie Luo , Guodong Gao and more
Potential Business Impact:
Makes sure everyone gets a fair share of benefits.
Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.
Similar Papers
Multi-Armed Bandits with Minimum Aggregated Revenue Constraints
Machine Learning (CS)
Helps websites show you the best ads.
Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts
Machine Learning (CS)
Helps computers learn faster by grouping similar users.
Learning Peer Influence Probabilities with Linear Contextual Bandits
Machine Learning (CS)
Helps spread good ideas faster online.