Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion
By: Hwanwoo Kim, Panos Toulis, Eric Laber
Potential Business Impact:
Makes computer learning more stable and faster.
Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL). For nearly forty years, TD learning has served as a workhorse for applied RL as well as a building block for more complex and specialized algorithms. However, despite its widespread use, TD procedures are generally sensitive to step size specification. A poor choice of step size can dramatically increase variance and slow convergence in both on-policy and off-policy evaluation tasks. In practice, researchers use trial and error to identify stable step sizes, but these approaches tend to be ad hoc and inefficient. As an alternative, we propose implicit TD algorithms that reformulate TD updates into fixed point equations. Such updates are more stable and less sensitive to step size without sacrificing computational efficiency. Moreover, we derive asymptotic convergence guarantees and finite-time error bounds for our proposed implicit TD algorithms, which include implicit TD(0), TD($\lambda$), and TD with gradient correction (TDC). Our results show that implicit TD algorithms are applicable to a much broader range of step sizes, and thus provide a robust and versatile framework for policy evaluation and value approximation in modern RL tasks. We demonstrate these benefits empirically through extensive numerical examples spanning both on-policy and off-policy tasks.
Similar Papers
Implicit Updates for Average-Reward Temporal Difference Learning
Machine Learning (Stat)
Makes computer learning more stable and efficient.
Accelerated Distributional Temporal Difference Learning with Linear Function Approximation
Machine Learning (Stat)
Learns how good choices are faster with less data.
Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation
Machine Learning (CS)
Teaches computers to learn tasks faster.