Improving LLM-Generated Code Quality with GRPO
By: Maxime Robeyns, Laurence Aitchison
Potential Business Impact:
Makes computer code better and safer to use.
Large Language Models (LLMs) are gaining widespread use for code generation. Recent training procedures use execution feedback as a reward signal, typically focusing on the functional correctness of the code, using unit test pass rate as a reward signal. However, this reward signal fails to capture notions of maintainability, quality and safety of the code produced. We address this under-explored area and develop a comprehensive library to quantify various aspects of code quality, and use it as a reward in GRPO. We find GRPO increases code quality according to this measure, which is confirmed by expert, blinded human annotators.
Similar Papers
From Reasoning to Code: GRPO Optimization for Underrepresented Languages
Machine Learning (CS)
Teaches computers to write code for rare languages.
Advancing Speech Understanding in Speech-Aware Language Models with GRPO
Computation and Language
Teaches computers to understand spoken words better.
Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization
Cryptography and Security
Finds computer bugs better by teaching AI.