RewardRank: Optimizing True Learning-to-Rank Utility
By: Gaurav Bhatt , Kiran Koshy Thekumparampil , Tanmay Gangwani and more
Potential Business Impact:
Shows online stores what shoppers really want.
Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) $\textit{KD-Eval}$, using a position-aware oracle for counterfactual reward estimation, and (ii) $\textit{LLM-Eval}$, which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: https://github.com/GauravBh1010tt/RewardRank
Similar Papers
Addressing Personalized Bias for Unbiased Learning to Rank
Information Retrieval
Helps search engines show better results for everyone.
Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest
Machine Learning (CS)
Makes online ads show better for you.
LoRe: Personalizing LLMs via Low-Rank Reward Modeling
Machine Learning (CS)
Teaches AI to learn what you like.