CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to simultaneously satisfy generativity, competitiveness, and long-term strategic requirements, and lack fine-grained evaluation signals to assess large language models’ capabilities in complex strategic decision-making. To address this gap, this work proposes CivBench—a novel benchmark grounded in multiplayer games of Civilization V—that establishes a process-oriented paradigm for evaluating strategic competence by predicting win-probability trajectories from turn-level game states. Rather than relying on sparse win/loss outcomes, CivBench leverages dynamic win-probability estimates and validates its approach through predictive, construct, and convergent validity analyses. Evaluated across 307 games, CivBench effectively differentiates the strategic capabilities of seven large language models, uncovering nuanced behavioral patterns that final game outcomes alone fail to capture.
📝 Abstract
Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.
Problem

Research questions and friction points this paper is trying to address.

strategic decision-making
LLM agents
benchmark evaluation
long-horizon multi-agent
Civilization V
Innovation

Methods, ideas, or system contributions that make the work stand out.

strategic decision-making
long-horizon evaluation
multi-agent benchmark
victory probability estimation
LLM agents
🔎 Similar Papers
No similar papers found.
John Chen
John Chen
Associate Professor of Entrepreneurship and Strategy, Baylor University
Strategic managementlearning under uncertaintycompetitive advantageentrepreneurshipreal
S
Sihan Cheng
Northwestern University
C
Can Gurkan
Northwestern University
M
Mingyi Lin
University of Arizona