Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

📅 2024-10-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak instruction-following capability of large language models (LLMs) in knowledge-intensive tasks—e.g., failing to modify answers as instructed or being misled by irrelevant distractor instructions—by introducing the first verifiable instruction-following benchmark for knowledge tasks. Methodologically, it decouples instruction-following from factual knowledge by injecting conditional and distractor instructions into established knowledge benchmarks (e.g., MMLU, ARC); proposes an LLM-free automatic verification mechanism; designs two novel instruction paradigms—answer-dependent and option-space-driven; and conducts zero-shot evaluation across 25 open- and closed-source models spanning 1B to 405B parameters. Results reveal that even models extensively fine-tuned on instruction data frequently violate simple, unambiguous instructions. The project releases the benchmark dataset, evaluation code, and comprehensive results to support reproducible research.

Technology Category

Application Category

📝 Abstract
In this work, we focus our attention on developing a benchmark for instruction-following where it is easy to verify both task performance as well as instruction-following capabilities. We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. This allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions. In contrast to existing benchmarks for instruction following, we not only measure instruction-following capabilities but also use LLM-free methods to study task performance. We study a series of openly available large language models of varying parameter sizes (1B-405B) and closed source models namely GPT-4o-mini, GPT-4o. We find that even large-scale instruction-tuned LLMs fail to follow simple instructions in zero-shot settings. We release our dataset, the benchmark, code, and results for future work.
Problem

Research questions and friction points this paper is trying to address.

Study interaction between knowledge and instruction following in LLMs
Evaluate LLMs' ability to follow simple answer-modifying instructions
Assess impact of irrelevant instructions on knowledge task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multiple-choice knowledge benchmarks
Applies text and numeric manipulation instructions
Tests LLMs with distractor and list operations
🔎 Similar Papers