Synthetic Feature Augmentation Improves Generalization Performance of Language Models

📅 2025-01-11

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the weak generalization and severe prediction bias of large language models (LLMs) on few-shot and class-imbalanced text data, this paper proposes an embedding-space synthetic feature augmentation method. Unlike conventional approaches, it operates directly in the language model’s latent embedding space—bypassing raw text generation—and jointly synthesizes minority-class features via embedding interpolation, noise perturbation, and adversarial generation to optimize semantic representation distributions. The method integrates seamlessly into standard fine-tuning pipelines and is compatible with mainstream open-source text classification benchmarks. Experiments across multiple benchmarks demonstrate up to a 12.3% improvement in minority-class F1 score, alongside consistent gains in overall accuracy and robustness. The core innovation lies in migrating synthetic data generation from the input space to the embedding space, enabling efficient, lossless, and fair representation calibration.

Technology Category

Application Category

📝 Abstract

Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant classes and underperform on minority classes, leading to biased predictions and reduced robustness in real-world applications. To overcome these challenges, we propose augmenting features in the embedding space by generating synthetic samples using a range of techniques. By upsampling underrepresented classes, this method improves model performance and alleviates data imbalance. We validate the effectiveness of this approach across multiple open-source text classification benchmarks, demonstrating its potential to enhance model robustness and generalization in imbalanced data scenarios.

Problem

Research questions and friction points this paper is trying to address.

Imbalanced Data

Large Language Models

Predictive Inequity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic Sample Generation

Imbalanced Data Handling

Enhanced Text Classification

🔎 Similar Papers

No similar papers found.