LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
By: Junda He , Jieke Shi , Terry Yue Zhuo and more
Potential Business Impact:
Lets computers check other computer-made code.
The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.
Similar Papers
From Code to Courtroom: LLMs as the New Software Judges
Software Engineering
Lets computers check other computer code quality.
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
Software Engineering
Helps computers judge code quality like people.
An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
Software Engineering
Checks computer code better than humans.