SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation
By: Jongmin Lee, Meiqi Sun, Pieter Abbeel
Potential Business Impact:
Teaches robots to learn new skills faster.
In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.
Similar Papers
Semi-gradient DICE for Offline Constrained Reinforcement Learning
Machine Learning (CS)
Helps robots learn safely from past experiences.
Average-DICE: Stationary Distribution Correction by Regression
Machine Learning (CS)
Improves computer learning by fixing data errors.
Distributionally Robust Online Markov Game with Linear Function Approximation
Machine Learning (Stat)
Helps robots learn real-world tasks from practice.