🤖 AI Summary
To address the weak generalization and severe prediction bias of large language models (LLMs) on few-shot and class-imbalanced text data, this paper proposes an embedding-space synthetic feature augmentation method. Unlike conventional approaches, it operates directly in the language model’s latent embedding space—bypassing raw text generation—and jointly synthesizes minority-class features via embedding interpolation, noise perturbation, and adversarial generation to optimize semantic representation distributions. The method integrates seamlessly into standard fine-tuning pipelines and is compatible with mainstream open-source text classification benchmarks. Experiments across multiple benchmarks demonstrate up to a 12.3% improvement in minority-class F1 score, alongside consistent gains in overall accuracy and robustness. The core innovation lies in migrating synthetic data generation from the input space to the embedding space, enabling efficient, lossless, and fair representation calibration.
📝 Abstract
Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant classes and underperform on minority classes, leading to biased predictions and reduced robustness in real-world applications. To overcome these challenges, we propose augmenting features in the embedding space by generating synthetic samples using a range of techniques. By upsampling underrepresented classes, this method improves model performance and alleviates data imbalance. We validate the effectiveness of this approach across multiple open-source text classification benchmarks, demonstrating its potential to enhance model robustness and generalization in imbalanced data scenarios.