JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer
By: Zhichao Shi , Xuhui Jiang , Chengjin Xu and more
Potential Business Impact:
Tests AI better to make it smarter.
Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluation and mismatched question difficulty, leading to incomplete evaluations of LLM's knowledge and capability boundaries, which hinder LLM's effective application and optimization. To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation. Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to call knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more complete evaluations of the LLM's knowledge boundaries. It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs. Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool, and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves. Extensive experiments validate the effectiveness of JudgeAgent's suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models. The source code is available on https://anonymous.4open.science/r/JudgeAgent.
Similar Papers
JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer
Computation and Language
Tests AI better by asking harder, changing questions.
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Computation and Language
Helps computers judge writing better than people.
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs
Artificial Intelligence
AI judges check other AI's work for mistakes.