Reinforcement Learning for Machine Learning Engineering Agents
By: Sherry Yang, Joy He-Yueya, Percy Liang
Potential Business Impact:
Teaches computers to learn and get better at tasks.
Existing agents for solving tasks such as ML engineering rely on prompting powerful language models. As a result, these agents do not improve with more experience. In this paper, we show that agents backed by weaker models that improve via reinforcement learning (RL) can outperform agents backed by much larger, but static models. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. To tackle variable-duration actions, we propose duration-aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using only test split performance as a reward provides limited feedback. A program that is nearly correct is treated the same as one that fails entirely. To address this, we propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early (e.g., during data loading). Environment instrumentation uses a separate static language model to insert print statement to an existing program to log the agent's experimental progress, from which partial credit can be extracted as reward signals for learning. Our experimental results on MLEBench suggest that performing gradient updates on a much smaller model (Qwen2.5-3B) trained with RL outperforms prompting a much larger model (Claude-3.5-Sonnet) with agent scaffolds, by an average of 22% across 12 Kaggle tasks.
Similar Papers
Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
Computation and Language
Teaches computers to solve hard problems better.
Expert-Free Online Transfer Learning in Multi-Agent Reinforcement Learning
Artificial Intelligence
Teaches computers new skills faster by using old knowledge.
Towards General-Purpose Model-Free Reinforcement Learning
Machine Learning (CS)
Makes computers learn many tasks with one method.