Improving Data Efficiency via Curating LLM-Driven Rating Systems

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM)-driven data quality scoring systems suffer from systematic biases and estimation errors, leading to inefficient data selection in instruction tuning. Method: We propose DS², a data subset optimization framework that (1) models LLM scoring error patterns and constructs a score transition matrix to calibrate systematic bias, and (2) incorporates diversity-aware sampling constraints to mitigate overrepresentation of high-scoring yet homogeneous samples. Contribution/Results: DS² challenges the conventional “more data is better” assumption, enabling efficient and robust subset selection. Experiments show that DS² achieves superior alignment performance across multiple benchmarks using only 3.3% (1K samples) of the original dataset—outperforming the full 300K dataset and matching or exceeding the human-curated LIMA dataset. This demonstrates substantial gains in both data efficiency and alignment capability.

Technology Category

Application Category

📝 Abstract
Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that"more can be less."
Problem

Research questions and friction points this paper is trying to address.

Improves data efficiency by curating LLM-driven rating systems.
Addresses inaccuracies and biases in LLM-based data quality ratings.
Challenges traditional data scaling laws with smaller, high-quality datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-aware Score curation method DS2
Systematic error modeling via score transition matrix
Curated subset outperforms full-scale datasets
🔎 Similar Papers
No similar papers found.