Offline Contextual Bandit with Counterfactual Sample Identification
By: Alexandre Gilotte , Otmane Sakhi , Imad Aouali and more
Potential Business Impact:
Finds best choices by comparing what happened.
In production systems, contextual bandit approaches often rely on direct reward models that take both action and context as input. However, these models can suffer from confounding, making it difficult to isolate the effect of the action from that of the context. We present \emph{Counterfactual Sample Identification}, a new approach that re-frames the problem: rather than predicting reward, it learns to recognize which action led to a successful (binary) outcome by comparing it to a counterfactual action sampled from the logging policy under the same context. The method is theoretically grounded and consistently outperforms direct models in both synthetic experiments and real-world deployments.
Similar Papers
Counterfactual Inference under Thompson Sampling
Information Retrieval
Helps websites show you better stuff.
Abstract Counterfactuals for Language Model Agents
Machine Learning (CS)
Helps AI understand "what if" questions better.
Multiplayer Information Asymmetric Contextual Bandits
Machine Learning (CS)
Helps multiple players learn best actions together.