ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current LLM evaluation is vulnerable to training data contamination and answer leakage, leading to inflated estimates of reasoning capability. To address this, we propose the first out-of-distribution (OOD) evaluation framework explicitly designed for reasoning robustness. Our method introduces a dynamic prompt-driven multi-task OOD data generation mechanism, yielding a high-quality external-distribution benchmark of 2,912 samples; it further establishes a standardized reasoning trajectory evaluation protocol enabling type-agnostic, fair comparison across both reasoning and non-reasoning models. Empirical evaluation across 20 mainstream models demonstrates that the framework effectively uncovers performance overestimation and data leakage, substantially improving assessment reliability. Key contributions include (1) the novel dynamic OOD generation mechanism, which ensures semantic diversity and distributional separation from pretraining and instruction-tuning corpora, and (2) a unified robustness evaluation paradigm grounded in trace-based fidelity and generalization metrics.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' reasoning robustness

Addresses data contamination in evaluation

Dynamically generates OOD datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic OOD data generation

Unified evaluation framework

Reduces data contamination impact

🔎 Similar Papers

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions