Model-Based Reinforcement Learning Under Confounding
By: Nishanth Venkatesh, Andreas A. Malikopoulos
Potential Business Impact:
Lets computers learn from past mistakes without seeing everything.
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
Similar Papers
Confounding Robust Deep Reinforcement Learning: A Causal Approach
Artificial Intelligence
Makes AI learn safely from bad past game data.
Towards Causal Model-Based Policy Optimization
Machine Learning (CS)
Teaches computers to make better choices when things change.
Efficient Solution and Learning of Robust Factored MDPs
Machine Learning (CS)
Makes AI learn safe actions with fewer tries