SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
By: Jiahao Zhao , Shuaixing Zhang , Nan Xu and more
Potential Business Impact:
Tests how well computers write survey answers.
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
Similar Papers
SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?
Computation and Language
Tests if AI can write good research summaries.
InteractiveSurvey: An LLM-based Personalized and Interactive Survey Paper Generation System
Information Retrieval
Writes survey papers faster, with your help.
Benchmarking Computer Science Survey Generation
Computation and Language
Helps computers write summaries of science papers.