🤖 AI Summary
To address performance degradation in text classification caused by distributional shift in synthetic data generated by large language models (LLMs), this paper proposes a lightweight weighted loss framework. Leveraging only a small number of real labeled samples, the method automatically evaluates synthetic instances along two dimensions—quality and diversity—and dynamically assigns learning weights to them during training of BERT-based classifiers. Crucially, it is the first approach to jointly couple data weighting with alignment of the synthetic distribution induced by the LLM, without requiring generator fine-tuning, additional human annotation, or architectural modifications. Extensive experiments across multiple text classification benchmarks demonstrate an average accuracy improvement of 2.3% over standard cross-entropy loss and state-of-the-art weighting strategies. The method exhibits strong generalization and is agnostic to the choice of LLM generator.
📝 Abstract
Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.