🤖 AI Summary
This paper addresses the performance degradation of in-context learning (ICL) in large language models caused by class-imbalanced (long-tailed) label distributions in demonstration selection. It identifies that conventional resampling and reweighting techniques—effective in standard supervised learning—fail in the ICL setting due to the absence of explicit model parameter updates and gradient-based optimization. To this end, the authors propose a dual-weight correction mechanism: (1) it decouples the annotation-test distribution shift into class-level weights and conditional bias; (2) it estimates conditional bias via a balanced validation set and dynamically adjusts demonstration selection scores; and (3) it introduces an empirical error-minimization-based conditional bias estimator and a two-component weighted scoring function, operating orthogonally to existing selection methods without architectural modification. Evaluated on multiple long-tailed ICL benchmarks, the method achieves an average accuracy improvement of 5.46%, significantly outperforming prior rebalancing approaches.
📝 Abstract
Large language models (LLMs) have shown impressive performance on downstream tasks through in-context learning (ICL), which heavily relies on the demonstrations selected from annotated datasets. Existing selection methods may hinge on the distribution of annotated datasets, which can often be long-tailed in real-world scenarios. In this work, we show that imbalanced class distributions in annotated datasets significantly degrade the performance of ICL across various tasks and selection methods. Moreover, traditional rebalance methods fail to ameliorate the issue of class imbalance in ICL. Our method is motivated by decomposing the distributional differences between annotated and test datasets into two-component weights: class-wise weights and conditional bias. The key idea behind our method is to estimate the conditional bias by minimizing the empirical error on a balanced validation dataset and to employ the two-component weights to modify the original scoring functions during selection. Our approach can prevent selecting too many demonstrations from a single class while preserving the effectiveness of the original selection methods. Extensive experiments demonstrate the effectiveness of our method, improving the average accuracy by up to 5.46 on common benchmarks with imbalanced datasets.