ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
By: Li S. Yifei , Allen Chang , Chaitanya Malaviya and more
Potential Business Impact:
Tests AI answers across many science topics.
Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.
Similar Papers
SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?
Computation and Language
Tests if AI can write good research summaries.
LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints
Computation and Language
Tests doctor AI answers faster, cheaper.
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
Computation and Language
Helps computers understand stories better.