LongIns: A Challenging Long-context Instruction-based Exam for LLMs

📅 2024-06-25
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context evaluation benchmarks predominantly focus on information retrieval, neglecting instruction understanding and multi-step reasoning capabilities, while lacking systematic assessment of actual supported context lengths. Method: We introduce LongIns, the first benchmark specifically designed for evaluating long-context instruction understanding and multi-hop reasoning. It proposes three novel evaluation paradigms—GIST (global instruction summarization), LIST (local instruction segmentation), and LIMT (long-input multi-task)—integrated with structured instruction embedding, task-decoupling mechanisms, and a multi-granularity human–machine collaborative scoring framework. Contribution/Results: Experiments reveal severe performance degradation in state-of-the-art LLMs (e.g., GPT-4-128k) beyond 16k tokens; most models achieve less than 35% accuracy on multi-hop reasoning tasks even under 4k context length. LongIns transcends retrieval-centric limitations and establishes a new standard for comprehensive long-context capability evaluation.

Technology Category

Application Category

📝 Abstract
The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction&Single Task (GIST), Local Instruction&Single Task (LIST), and Local Instruction&Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' long-context reasoning beyond retrieval tasks
Assesses actual supported context length of LLMs
Tests multi-hop reasoning under short context windows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LongIns benchmark for LLMs evaluation
Uses GIST, LIST, LIMT evaluation settings
Tests LLMs on multi-hop reasoning ability
🔎 Similar Papers
No similar papers found.
S
Shawn Gavin
M-A-P
T
Tuney Zheng
University of Waterloo
J
Jiaheng Liu
M-A-P
Q
Quehry Que
M-A-P
N
Noah Wang
University of Waterloo
J
Jian Yang
M-A-P
C
Chenchen Zhang
M-A-P
W
Wenhao Huang
01.ai
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
G
Ge Zhang
University of Waterloo