🤖 AI Summary
Current LLM evaluation is vulnerable to training data contamination and answer leakage, leading to inflated estimates of reasoning capability. To address this, we propose the first out-of-distribution (OOD) evaluation framework explicitly designed for reasoning robustness. Our method introduces a dynamic prompt-driven multi-task OOD data generation mechanism, yielding a high-quality external-distribution benchmark of 2,912 samples; it further establishes a standardized reasoning trajectory evaluation protocol enabling type-agnostic, fair comparison across both reasoning and non-reasoning models. Empirical evaluation across 20 mainstream models demonstrates that the framework effectively uncovers performance overestimation and data leakage, substantially improving assessment reliability. Key contributions include (1) the novel dynamic OOD generation mechanism, which ensures semantic diversity and distributional separation from pretraining and instruction-tuning corpora, and (2) a unified robustness evaluation paradigm grounded in trace-based fidelity and generalization metrics.
📝 Abstract
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.