Experience-Efficient Model-Free Deep Reinforcement Learning Using Pre-Training
By: Ruoxing Yang
Potential Business Impact:
Teaches robots faster with less practice.
We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.
Similar Papers
Pretraining in Actor-Critic Reinforcement Learning for Robot Motion Control
Robotics
Teaches robots new skills faster and better.
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
Machine Learning (CS)
Teaches computers to learn tasks faster and better.
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
Machine Learning (CS)
Makes AI art generators better and faster.