Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning
By: Xihong Su
Potential Business Impact:
Helps computers make smarter, safer choices.
This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.
Similar Papers
Robust Bayesian Dynamic Programming for On-policy Risk-sensitive Reinforcement Learning
Risk Management
Teaches computers to make safer, smarter decisions.
Provably Efficient Sample Complexity for Robust CMDP
Machine Learning (CS)
Teaches robots to be safe and smart.
Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees
Machine Learning (CS)
Teaches robots to make safe choices always.