SynQuE: Estimating Synthetic Dataset Quality Without Annotations

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This paper introduces the Synthetic Data Quality Estimation (SynQuE) problem: predicting the downstream task performance of synthetic data given only a small set of unlabeled real-world samples. To address this, we establish the first SynQuE benchmark and proxy metric suite, and propose LENS—a large language model (LLM)-guided reasoning framework that jointly models embedding similarity, distributional shift, and diversity, while incorporating LLM-based semantic alignment to detect subtle synthetic artifacts in complex tasks. Experiments demonstrate that LENS achieves strong correlation with ground-truth task performance (average Spearman ρ > 0.85). When applied to text-to-SQL generation, selecting high-LENS-score synthetic data improves accuracy by 8.1% on average. Our work provides the first practical, scalable evaluation framework for synthetic data selection under data scarcity.

Technology Category

Application Category

📝 Abstract

We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose LENS, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with LENS consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.

Problem

Research questions and friction points this paper is trying to address.

Estimating synthetic dataset quality without annotation requirements

Ranking synthetic datasets using limited unannotated real data

Selecting optimal synthetic training data to maximize real-world task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates synthetic dataset quality without annotations

Uses distribution and diversity-based proxy metrics

Introduces LENS proxy leveraging large language models

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models