LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluations rely heavily on static, single-task metrics, failing to capture dynamic decision-making and social behaviors in strategic interactions. Method: We introduce StratBench—the first comprehensive benchmark for strategic intelligence—grounded in game-theoretic multi-agent environments spanning canonical cooperative and competitive scenarios. It features dynamic interaction protocols and a multidimensional scoring framework assessing strategy consistency, reasoning depth, social preferences, and more. We conduct cross-evaluation across 15 state-of-the-art LLMs. Contribution/Results: Advanced models (e.g., GPT-4) demonstrate significantly superior strategic stability and cross-context reasoning capability. All evaluation data, protocols, and leaderboards are publicly released. StratBench overcomes fundamental limitations of traditional benchmarks by enabling reproducible, scalable, and quantifiable assessment of strategic intelligence in LLMs.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important. To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors. We present LLMsPark, a game theory-based evaluation platform that measures LLMs' decision-making strategies and social behaviors in classic game-theoretic settings, providing a multi-agent environment to explore strategic depth. Our system cross-evaluates 15 leading LLMs (both commercial and open-source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models. This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios. The benchmark and rankings are publicly available at https://llmsparks.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' strategic decision-making in game theory contexts
Assessing interactive dynamics and social behaviors of language models
Measuring reasoning capabilities through multi-agent gaming scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Game theory-based platform for LLM evaluation
Multi-agent environment testing strategic behaviors
Cross-evaluation of 15 LLMs via leaderboard
🔎 Similar Papers
No similar papers found.
J
Junhao Chen
Shenzhen International Graduate School, Tsinghua University
J
Jingbo Sun
Institute of Computing Technology, Chinese Academy of Sciences
X
Xiang Li
School of Software and Microelectronics, Peking University
Haidong Xin
Haidong Xin
Northeastern University; Harbin Engineering University
NLPInformation RetrievalRAGRecsysMulti-agent
Y
Yuhao Xue
Tongji University
Y
Yibin Xu
Tongji University
H
Hao Zhao
AIR, Tsinghua University; BAAI