DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of reliable evaluation for large language models’ (LLMs) multi-step reasoning over long contexts. We introduce the first benchmark specifically designed for real-world long documents, comprising 100 expert-crafted question-answer pairs. To overcome answer-oriented evaluation biases, we propose a process-aware assessment paradigm: leveraging human-AI collaborative annotation and a checklist-guided process analysis framework to structurally quantify reasoning path quality. This approach mitigates answer-guessing artifacts and establishes a new standard for evaluating long-context reasoning capability. Experimental results show that o1-preview achieves 69.7% accuracy—significantly outperforming Claude 3.5 Sonnet (57.7%). Notably, distilled variants exhibit substantial performance degradation, demonstrating that sophisticated multi-step reasoning over long contexts is not effectively transferable via knowledge distillation. Our benchmark and methodology provide a rigorous, process-centric foundation for advancing robust long-context reasoning evaluation.

Technology Category

Application Category

📝 Abstract
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.
Problem

Research questions and friction points this paper is trying to address.

Evaluate long-context reasoning in LLMs.
Introduce QA problems for multi-step reasoning.
Mitigate guessing bias through process analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-AI collaborative annotation-validation pipeline
Checklist-guided process analysis mitigates guessing bias
Benchmark assesses multi-step reasoning in long documents