GameArena: Evaluating LLM Reasoning through Live Computer Games

📅 2024-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM reasoning evaluation methods suffer from static datasets, coarse-grained binary feedback, and detachment from authentic interactive settings, hindering fine-grained disentanglement of specific reasoning skills (e.g., deduction, induction). To address this, we propose the first *in-the-wild*, gamified dynamic benchmark comprising three human-AI adversarial interaction games. Our framework integrates behavioral log reconstruction, multi-dimensional reasoning metric modeling, and a large-scale human-AI co-evaluation platform to enable process-oriented, fine-grained reasoning assessment. We introduce the novel “stepwise reasoning data collection” paradigm, overcoming the limitations of binary outcome feedback. Through 2,000+ real-world interaction sessions, we generate interpretable, skill-level reasoning profiles for five state-of-the-art LLMs. A user study demonstrates significantly higher engagement compared to Chatbot Arena, validating ecological validity and practical utility.

Technology Category

Application Category

📝 Abstract
Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that conflates reasoning with other abilities. As the most prominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in real-world settings, but lacks the granularity in assessing specific reasoning capabilities. We introduce GameArena, a dynamic benchmark designed to evaluate LLM reasoning capabilities through interactive gameplay with humans. GameArena consists of three games designed to test specific reasoning capabilities (e.g., deductive and inductive reasoning), while keeping participants entertained and engaged. We analyze the gaming data retrospectively to uncover the underlying reasoning processes of LLMs and measure their fine-grained reasoning capabilities. We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs. Our user study with 100 participants suggests that GameArena improves user engagement compared to Chatbot Arena. For the first time, GameArena enables the collection of step-by-step LLM reasoning data in the wild.
Problem

Research questions and friction points this paper is trying to address.

Evaluation Methods
Large Language Models
Cognitive Abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

GameArena
LanguageModelTesting
InteractiveEvaluation
🔎 Similar Papers