Agent-as-a-Judge
By: Runyang You , Hongru Cai , Caiqi Zhang and more
Potential Business Impact:
Makes AI judges smarter and more trustworthy.
LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
Similar Papers
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs
Artificial Intelligence
AI judges check other AI's work for mistakes.
Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation
Computation and Language
AI judges give better feedback on tasks.
JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer
Computation and Language
Tests AI better by asking harder, changing questions.