🤖 AI Summary
Prior work assumes example selection dominates example ordering in in-context learning (ICL), treating the latter as negligible. This study challenges that assumption by systematically investigating how example order affects large language model (LLM) performance. Method: Through controlled experiments across classification and generation tasks, we evaluate open-source models (0.5B–27B parameters) and GPT-5, isolating the impact of permutation while holding example sets constant. Contribution/Results: We demonstrate that reordering examples induces performance fluctuations comparable in magnitude to replacing the entire example set—establishing ordering as equally critical as selection. Moreover, we provide the first empirical evidence that near-optimal permutations can be efficiently discovered using only development-set labels, achieving performance close to globally optimal (test-label-dependent) ordering. This work introduces a new ICL paradigm—jointly optimizing example selection and ordering—and proposes a lightweight, practical method for order optimization, advancing prompt engineering with theoretically grounded, empirically validated insights.
📝 Abstract
In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.