LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes LudoBench, the first framework to formalize the board game Ludo as a structured benchmark for evaluating large language models’ (LLMs’) strategic decision-making in complex environments involving stochasticity and multi-agent interactions. The benchmark comprises 480 handcrafted scenarios spanning 12 distinct decision behaviors and integrates an Expectiminimax-based game-theoretic optimal agent, heuristic agents, and LLM-driven agents within a unified four-player simulation environment. Experimental results reveal that six prominent LLMs align with optimal strategies in only 40–46% of scenarios, exposing systematic biases—such as preferences for “completion-oriented” versus “development-oriented” strategies—and heightened sensitivity to prompt variations. These findings underscore LudoBench’s effectiveness and interpretability in probing the strategic reasoning capabilities of LLMs under uncertainty and competition.
📝 Abstract
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/
Problem

Research questions and friction points this paper is trying to address.

strategic reasoning
behavioral decision-making
multi-agent board game
LLM evaluation
uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

LudoBench
strategic reasoning
Expectiminimax
behavioral decision-making
prompt sensitivity