SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

๐Ÿ“… 2025-12-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current large language models (LLMs) lack systematic, multidimensional evaluation criteria for generating academic survey papers. To address this gap, we propose SurveyEvalโ€”the first comprehensive benchmark specifically designed for evaluating academic survey generation. SurveyEval assesses outputs across three core dimensions: overall survey quality, outline coherence, and reference accuracy, spanning seven disciplinary domains. Methodologically, it innovatively integrates retrieval-augmented generation (RAG), long-context evaluation techniques, and an enhanced LLM-as-a-Judge framework, augmented with human-annotated references to improve alignment between automated metrics and human judgment. Experimental results demonstrate that domain-specific survey generation systems significantly outperform general-purpose long-text or academic writing models. SurveyEval exhibits strong discriminative power and scalability, offering a reliable, open, and extensible evaluation platform for future research in academic survey generation.

Technology Category

Application Category

๐Ÿ“ Abstract
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
Problem

Research questions and friction points this paper is trying to address.

Evaluating complex LLM-generated academic survey systems comprehensively.
Assessing survey quality, outline coherence, and reference accuracy across subjects.
Enhancing evaluation-human alignment using benchmarks and human references.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for evaluating LLM-generated surveys
Extends evaluation across 7 subjects with human references
Augments LLM-as-a-Judge framework for better alignment
๐Ÿ”Ž Similar Papers
No similar papers found.
Jiahao Zhao
Jiahao Zhao
Institute of automation, Chinese Academy of Sciences
LLM Alignment
S
Shuaixing Zhang
Beijing Wenge Technology Co., Ltd. Beijing, China
N
Nan Xu
Beijing Wenge Technology Co., Ltd. Beijing, China
L
Lei Wang
Beijing Wenge Technology Co., Ltd. Beijing, China