Improving Data Efficiency via Curating LLM-Driven Rating Systems

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large language model (LLM)-driven data quality scoring systems suffer from systematic biases and estimation errors, leading to inefficient data selection in instruction tuning. Method: We propose DS², a data subset optimization framework that (1) models LLM scoring error patterns and constructs a score transition matrix to calibrate systematic bias, and (2) incorporates diversity-aware sampling constraints to mitigate overrepresentation of high-scoring yet homogeneous samples. Contribution/Results: DS² challenges the conventional “more data is better” assumption, enabling efficient and robust subset selection. Experiments show that DS² achieves superior alignment performance across multiple benchmarks using only 3.3% (1K samples) of the original dataset—outperforming the full 300K dataset and matching or exceeding the human-curated LIMA dataset. This demonstrates substantial gains in both data efficiency and alignment capability.

Technology Category

Application Category

📝 Abstract

Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that"more can be less."

Problem

Research questions and friction points this paper is trying to address.

Improves data efficiency by curating LLM-driven rating systems.

Addresses inaccuracies and biases in LLM-based data quality ratings.

Challenges traditional data scaling laws with smaller, high-quality datasets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-aware Score curation method DS2

Systematic error modeling via score transition matrix

Curated subset outperforms full-scale datasets

🔎 Similar Papers

LitLLM: A Toolkit for Scientific Literature Review