Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

📅 2024-10-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in text classification caused by distributional shift in synthetic data generated by large language models (LLMs), this paper proposes a lightweight weighted loss framework. Leveraging only a small number of real labeled samples, the method automatically evaluates synthetic instances along two dimensions—quality and diversity—and dynamically assigns learning weights to them during training of BERT-based classifiers. Crucially, it is the first approach to jointly couple data weighting with alignment of the synthetic distribution induced by the LLM, without requiring generator fine-tuning, additional human annotation, or architectural modifications. Extensive experiments across multiple text classification benchmarks demonstrate an average accuracy improvement of 2.3% over standard cross-entropy loss and state-of-the-art weighting strategies. The method exhibits strong generalization and is agnostic to the choice of LLM generator.

Technology Category

Application Category

📝 Abstract
Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.
Problem

Research questions and friction points this paper is trying to address.

Align synthetic LLM data with real-world distribution
Improve text classification using weighted-loss approaches
Enhance model performance with limited real-world data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted-loss aligns synthetic with real data
Emphasizes high-quality diversified LLM-generated data
BERT-level model outperforms standard approaches
🔎 Similar Papers
No similar papers found.
H
Hsun-Yu Kuo
Data Science Degree Program, National Taiwan University and Academia Sinica; Swiss Federal Institute of Technology in Lausanne (EPFL)
Y
Yin-Hsiang Liao
Data Science Degree Program, National Taiwan University and Academia Sinica
Y
Yu-Chieh Chao
University of California, Davis
Wei-Yun Ma
Wei-Yun Ma
Data Science Degree Program, National Taiwan University and Academia Sinica
Pu-Jen Cheng
Pu-Jen Cheng
Assistant Professor of Computer Science and Information Engineering, National Taiwan University
Information RetrievalWeb Mining