Exploitation Is All You Need... for Exploration
By: Micah Rentschler, Jesse Roberts
Potential Business Impact:
Computers learn to explore new games by remembering.
Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 & 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.
Similar Papers
Disentangling Exploration of Large Language Models by Optimal Exploitation
Machine Learning (CS)
Helps computers learn better by exploring new things.
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Machine Learning (CS)
Makes AI better at math by tricking it.
Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
Machine Learning (CS)
Teaches robots to explore safely without being told.