LR${}^{2}$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack rigorous evaluation of long-chain reflective reasoning—i.e., multi-step hypothesis generation, backtracking correction, and self-refinement—particularly under dynamic constraints. Method: We introduce LR²Bench, the first dedicated benchmark for this capability, comprising 850 constraint satisfaction problems (CSPs) spanning knowledge, logic, and spatial reasoning, all requiring iterative validation and adaptive adjustment. We formally define and quantify long-chain reflective reasoning and propose a strict, exact-match–based multi-step path evaluation protocol compatible with diverse reasoning models. Results: Experiments reveal severe limitations: state-of-the-art models DeepSeek-R1 and o1-preview achieve only 20.0% and 23.6% average accuracy, respectively, exposing fundamental bottlenecks in complex reflective reasoning. LR²Bench provides a reproducible, extensible evaluation infrastructure to advance research in this critical direction.

Technology Category

Application Category

📝 Abstract
Recent progress in o1-like models has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR${}^{2}$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR${}^{2}$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. We conduct extensive evaluation on both conventional models and o1-like models. Our experimental results reveal that even the most advanced reasoning-specific models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR${}^{2}$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs. The leaderboard of our benchmark is available at https://huggingface.co/spaces/UltraRonin/LR2Bench
Problem

Research questions and friction points this paper is trying to address.

Evaluate reflective reasoning in LLMs
Introduce LR2Bench for CSP evaluation
Highlight gaps in current LLM capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LR2Bench benchmark
Evaluates reflective reasoning in LLMs
Uses Constraint Satisfaction Problems
🔎 Similar Papers
No similar papers found.
Jianghao Chen
Jianghao Chen
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingLarge Language Models
Z
Zhenlin Wei
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Zhenjiang Ren
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Ziyong Li
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing