Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Human-curated factualness benchmarks for large language models (LLMs) are costly, narrow in domain coverage, and difficult to scale. Method: This paper proposes an automatic, document-driven synthetic evaluation framework that generates factual question-answer pairs end-to-end from domain-specific text (e.g., textbooks). It introduces a closed-loop paradigm combining LM self-evaluation with unsupervised document input, integrating self-supervised prompt generation, retrieval-augmented structured question synthesis, and dual-format (multiple-choice and open-ended) QA construction. Contribution/Results: Evaluated on recent arXiv papers, the method achieves strong agreement with human annotation (Kendall’s ρ = 0.96; accuracy correlation r = 0.79) and reveals unexpectedly strong performance of Gemma-3. By eliminating manual curation, this approach significantly lowers the barrier to constructing high-quality, domain-adaptable evaluation benchmarks and enables rapid cross-domain deployment.

Technology Category

Application Category

📝 Abstract

Language Models (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users might ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is limited and being rapidly outpaced by the size and scope of the models under evaluation. Additionally, having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages those very same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions with a Spearman ranking correlation of 0.96 and a benchmark evaluation Pearson accuracy correlation of 0.79. This novel tool supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on a recent relevant arXiv preprint, discovering a surprisingly strong performance from Gemma3 models.

Problem

Research questions and friction points this paper is trying to address.

Automating fact-based synthetic data evaluations for LMs

Reducing human effort in benchmark construction for LMs

Evaluating domain-specific knowledge using grounding documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates synthetic data evaluations using document corpora

Leverages LMs to assess domain-specific knowledge automatically

Generates multiple choice and open-ended diagnostic questions

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks