Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current large language models lack systematic evaluation of their multi-turn interactive reasoning capabilities, making it difficult to assess their ability to actively probe, integrate information, and respond appropriately in dynamic environments. This work proposes an evaluation framework grounded in executable game environments, modeling reasoning as an active querying and belief-updating process. The framework introduces metacognitive dimensions—including robustness to contextual perturbations, counterfactual revision, and necessity judgment—and comprises 474 interactive tasks spanning five difficulty levels for fine-grained assessment. Experimental results demonstrate that the benchmark effectively differentiates mainstream large language models in terms of both success rate and interaction efficiency. Notably, performance degradation on metacognitive tasks significantly exceeds that caused by standard perturbations, revealing critical limitations in current models’ higher-order reasoning capacities.

📝 Abstract

We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.

Problem

Research questions and friction points this paper is trying to address.

interactive reasoning

large language models

reasoning evaluation

contextual robustness

metacognitive adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive reasoning

hierarchical benchmark

executable games