🤖 AI Summary
This work investigates the effectiveness and applicability boundaries of multi-example in-context learning (ICL) in evaluating long-context language models (LCLMs). Addressing two key questions—“Which tasks benefit from more in-context examples?” and “How do retrieval versus global comprehension requirements differ across such tasks?”—we propose a dichotomous framework: Similar-Sample Learning (SSL) and All-Sample Learning (ASL). We further introduce MANYICLBENCH, the first long-context–oriented, multi-example ICL benchmark. Through systematic prompt engineering, retrieval-augmented analysis, and comprehensive evaluation, we find that state-of-the-art LCLMs exhibit robust performance on SSL tasks up to 64K tokens, whereas performance degrades significantly at 16K tokens on ASL tasks. Our study establishes a novel paradigm for disentangling LCLM capabilities and provides a standardized, task-structured evaluation toolkit for rigorous, fine-grained assessment of long-context reasoning and memory.
📝 Abstract
Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark, MANYICLBENCH, to characterize model's ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.