EvalAgent: Discovering Implicit Evaluation Criteria from the Web
By: Manya Wadhwa , Zayne Sprague , Chaitanya Malaviya and more
Potential Business Impact:
Helps AI write better, more helpful answers.
Evaluation of language model outputs on structured writing tasks is typically conducted with a number of desirable criteria presented to human evaluators or large language models (LLMs). For instance, on a prompt like "Help me draft an academic talk on coffee intake vs research productivity", a model response may be evaluated for criteria like accuracy and coherence. However, high-quality responses should do more than just satisfy basic task requirements. An effective response to this query should include quintessential features of an academic talk, such as a compelling opening, clear research questions, and a takeaway. To help identify these implicit criteria, we introduce EvalAgent, a novel framework designed to automatically uncover nuanced and task-specific criteria. EvalAgent first mines expert-authored online guidance. It then uses this evidence to propose diverse, long-tail evaluation criteria that are grounded in reliable external sources. Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit (not directly stated in the user's prompt), yet specific (high degree of lexical precision). Further, EvalAgent criteria are often not satisfied by initial responses but they are actionable, such that responses can be refined to satisfy them. Finally, we show that combining LLM-generated and EvalAgent criteria uncovers more human-valued criteria than using LLMs alone.
Similar Papers
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Computation and Language
Helps computers judge writing better with less help.
AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content
Artificial Intelligence
Helps computers check if AI writing is good.
JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer
Computation and Language
Tests AI better by asking harder, changing questions.