On the Reliability of Sampling Strategies in Offline Recommender Evaluation
By: Bruno L. Pereira, Alan Said, Rodrygo L. T. Santos
Potential Business Impact:
Makes computer suggestions more fair and accurate.
Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling strategies along four dimensions: sampling resolution (recommender model separability), fidelity (agreement with full evaluation), robustness (stability under exposure bias), and predictive power (alignment with ground truth). Our findings highlight when and how sampling distorts evaluation outcomes and offer practical guidance for selecting strategies that yield faithful and robust offline comparisons.
Similar Papers
On the Reliability of Sampling Strategies in Offline Recommender Evaluation
Information Retrieval
Makes movie suggestions more honest and fair.
Measuring the stability and plasticity of recommender systems
Information Retrieval
Tests how well movie suggestions learn new trends.
Algorithm Adaptation Bias in Recommendation System Online Experiments
Information Retrieval
Fixes online tests to show what really works.