🤖 AI Summary
This work addresses the lack of reliable evaluation for large language models’ (LLMs) multi-step reasoning over long contexts. We introduce the first benchmark specifically designed for real-world long documents, comprising 100 expert-crafted question-answer pairs. To overcome answer-oriented evaluation biases, we propose a process-aware assessment paradigm: leveraging human-AI collaborative annotation and a checklist-guided process analysis framework to structurally quantify reasoning path quality. This approach mitigates answer-guessing artifacts and establishes a new standard for evaluating long-context reasoning capability. Experimental results show that o1-preview achieves 69.7% accuracy—significantly outperforming Claude 3.5 Sonnet (57.7%). Notably, distilled variants exhibit substantial performance degradation, demonstrating that sophisticated multi-step reasoning over long contexts is not effectively transferable via knowledge distillation. Our benchmark and methodology provide a rigorous, process-centric foundation for advancing robust long-context reasoning evaluation.
📝 Abstract
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.