CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

📅 2024-08-02
🏛️ arXiv.org
📈 Citations: 16
Influential: 3
📄 PDF
🤖 AI Summary
Existing LLM constraint-following evaluation benchmarks are fragmented, lacking realism and systematicity from an end-user perspective. Method: We introduce CFBench—the first large-scale, user-centric constraint-following benchmark—comprising 1,000 diverse samples across 200+ real-world scenarios and 50+ NLP tasks. It systematically categorizes constraints into 10 high-level and 25+ fine-grained types, proposes the first taxonomy of constraint categories, and designs a user-aware evaluation framework integrating multi-dimensional consistency assessment and demand-priority modeling. Sampling leverages real instruction provenance, structured constraint injection, and human-AI collaborative evaluation. Contribution/Results: Experiments expose critical deficiencies in mainstream LLMs regarding format adherence, logical coherence, and safety compliance. All data, code, and the evaluation framework are publicly released to advance standardized, rigorous assessment of constraint-following capability.

Technology Category

Application Category

📝 Abstract
The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user's perspective. To bridge this gap, we propose CFBench, a large-scale Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code are publicly available at https://github.com/PKU-Baichuan-MLSystemLab/CFBench
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability to follow diverse real-world constraints
Addresses gaps in existing constraint-focused benchmarks for LLMs
Proposes multidimensional assessment aligned with user perceptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark with 1000 diverse samples
Systematic framework with 10 constraint categories
Multi-dimensional assessment with prioritization criteria
🔎 Similar Papers
No similar papers found.
T
Tao Zhang
Baichuan Inc.
Y
Yan-Bin Shen
Baichuan Inc.
W
Wenjing Luo
Baichuan Inc.
Y
Yan Zhang
Baichuan Inc.
H
Hao Liang
Peking University
F
Fan Yang
Baichuan Inc.
Mingan Lin
Mingan Lin
baichuan-inc
LLM、MLLM、AI
Y
Yujin Qiao
Baichuan Inc.
W
Weipeng Chen
Baichuan Inc.
B
Bin Cui
Peking University
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
Z
Zenan Zhou
Baichuan Inc.