SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic survey generation (ASG) evaluation methods suffer from metric bias, neglect of human preferences, and overreliance on large language model (LLM)-based scoring. To address these limitations, we propose SGSimEval—the first multidimensional benchmark integrating outline quality, content coverage, and reference relevance. Our key contribution is a human-preference-calibrated, similarity-enhanced evaluation framework that tightly couples LLM scoring, quantitative metrics, and human annotation, thereby establishing a hybrid qualitative–quantitative assessment paradigm. Experimental results show that current ASG systems achieve near-human performance in outline generation but remain substantially deficient in content depth and bibliographic grounding. Crucially, our proposed metrics exhibit strong agreement with human judgments (Spearman’s ρ > 0.85), significantly improving evaluation reliability and interpretability.

Technology Category

Application Category

📝 Abstract
The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.
Problem

Research questions and friction points this paper is trying to address.

Evaluates automatic survey generation systems comprehensively
Addresses biased metrics and lack of human preference
Improves content and reference generation in ASG systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

SGSimEval integrates outline, content, reference assessments
Combines LLM-based scoring with quantitative metrics
Introduces human preference metrics for quality
🔎 Similar Papers
No similar papers found.
B
Beichen Guo
The Hong Kong Polytechnic University, Hong Kong, China
Zhiyuan Wen
Zhiyuan Wen
The Hong Kong Polytechnic University
NLP
Y
Yu Yang
The Education University of Hong Kong, Hong Kong, China
P
Peng Gao
The Hong Kong Polytechnic University, Hong Kong, China
Ruosong Yang
Ruosong Yang
The Hong Kong Polytechnic University
NLP
J
Jiaxing Shen
Lingnan University, Hong Kong, China