🤖 AI Summary
Existing evaluation methods struggle to effectively measure large language models’ (LLMs’) social and strategic reasoning capabilities in long-term multi-agent interactions. To address this gap, this work proposes Mindgames—a multi-game arena built on the TextArena platform—that operationalizes theory of mind into four measurable dimensions: belief inference, opponent modeling, cooperative reasoning, and sustained deception. The framework introduces a dynamic competitive evaluation paradigm grounded in real interaction trajectories, integrating TrueSkill ratings, turn-level logs, and the MG-Ref offline tournament protocol. Analysis of 29,571 gameplay episodes reveals that rule-following fragility constitutes a primary bottleneck for current LLMs and demonstrates substantial variation in leaderboard validity across games, with Secret Mafia exhibiting notable confounding effects due to error tolerance.
📝 Abstract
Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.