Tail Distribution of Regret in Optimistic Reinforcement Learning
By: Sajad Khodadadian, Mehrdad Moharrami
Potential Business Impact:
Helps robots learn faster and make fewer mistakes.
We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $\Pr(R_K \ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $\mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.
Similar Papers
Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data
Machine Learning (CS)
Makes betting smarter, even with wild data.
Optimistic Reinforcement Learning with Quantile Objectives
Machine Learning (CS)
Teaches computers to make safer, smarter choices.
No-Regret Gaussian Process Optimization of Time-Varying Functions
Machine Learning (Stat)
Finds best answers even when things change.