Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
By: Alessandro Trapasso, Luca Iocchi, Fabio Patrizi
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
Similar Papers
Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning
Machine Learning (CS)
Helps robots learn better when things change.
Deontically Constrained Policy Improvement in Reinforcement Learning Agents
Artificial Intelligence
Teaches robots to do good things, not bad.
Model-Based Reinforcement Learning Under Confounding
Machine Learning (CS)
Lets computers learn from past mistakes without seeing everything.