Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
By: Alexander Golubev , Maria Trofimova , Sergei Polezhaev and more
Potential Business Impact:
Makes AI agents solve complex coding problems.
Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.
Similar Papers
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Machine Learning (CS)
Teaches AI to solve hard math problems better.
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Machine Learning (CS)
Makes smart computer programs learn better and faster.
Learning Robust Social Strategies with Large Language Models
Machine Learning (CS)
Teaches AI to work together, not cheat.