🤖 AI Summary
Existing automatic survey generation (ASG) evaluation methods suffer from metric bias, neglect of human preferences, and overreliance on large language model (LLM)-based scoring. To address these limitations, we propose SGSimEval—the first multidimensional benchmark integrating outline quality, content coverage, and reference relevance. Our key contribution is a human-preference-calibrated, similarity-enhanced evaluation framework that tightly couples LLM scoring, quantitative metrics, and human annotation, thereby establishing a hybrid qualitative–quantitative assessment paradigm. Experimental results show that current ASG systems achieve near-human performance in outline generation but remain substantially deficient in content depth and bibliographic grounding. Crucially, our proposed metrics exhibit strong agreement with human judgments (Spearman’s ρ > 0.85), significantly improving evaluation reliability and interpretability.
📝 Abstract
The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.