Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

📅 2024-10-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address performance degradation in text classification caused by distributional shift in synthetic data generated by large language models (LLMs), this paper proposes a lightweight weighted loss framework. Leveraging only a small number of real labeled samples, the method automatically evaluates synthetic instances along two dimensions—quality and diversity—and dynamically assigns learning weights to them during training of BERT-based classifiers. Crucially, it is the first approach to jointly couple data weighting with alignment of the synthetic distribution induced by the LLM, without requiring generator fine-tuning, additional human annotation, or architectural modifications. Extensive experiments across multiple text classification benchmarks demonstrate an average accuracy improvement of 2.3% over standard cross-entropy loss and state-of-the-art weighting strategies. The method exhibits strong generalization and is agnostic to the choice of LLM generator.

Technology Category

Application Category

📝 Abstract

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

Problem

Research questions and friction points this paper is trying to address.

Align synthetic LLM data with real-world distribution

Improve text classification using weighted-loss approaches

Enhance model performance with limited real-world data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted-loss aligns synthetic with real data

Emphasizes high-quality diversified LLM-generated data

BERT-level model outperforms standard approaches

🔎 Similar Papers

No similar papers found.

Authors to Follow