LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints
By: Federica Bologna , Tiffany Pan , Matthew Wilkens and more
Potential Business Impact:
Tests doctor AI answers faster, cheaper.
Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
Similar Papers
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
Computation and Language
Helps computers understand stories better.
An Empirical Study of Evaluating Long-form Question Answering
Information Retrieval
Makes computers write better, longer answers.
ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
Computation and Language
Tests AI answers across many science topics.