🤖 AI Summary
Existing benchmarks struggle to simultaneously satisfy generativity, competitiveness, and long-term strategic requirements, and lack fine-grained evaluation signals to assess large language models’ capabilities in complex strategic decision-making. To address this gap, this work proposes CivBench—a novel benchmark grounded in multiplayer games of Civilization V—that establishes a process-oriented paradigm for evaluating strategic competence by predicting win-probability trajectories from turn-level game states. Rather than relying on sparse win/loss outcomes, CivBench leverages dynamic win-probability estimates and validates its approach through predictive, construct, and convergent validity analyses. Evaluated across 307 games, CivBench effectively differentiates the strategic capabilities of seven large language models, uncovering nuanced behavioral patterns that final game outcomes alone fail to capture.
📝 Abstract
Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.