🤖 AI Summary
This work addresses the challenge of collaborative reasoning under partial observability in large language models by evaluating 17 prominent models within the multi-agent cooperative card game Hanabi, using three context-engineering configurations—Watson, Sherlock, and Mycroft—across 2–5 player settings. The study enhances state tracking through a working memory mechanism and integrates procedural prompting, Bayesian inference, supervised fine-tuning, and reinforcement learning (based on open-source models such as Qwen3-Instruct). It introduces the first annotated dataset of Hanabi trajectories and action-value estimates. Experimental results demonstrate that reinforcement learning fine-tuning boosts the score of a 4B-parameter model by 156%, approaching o4-mini performance and surpassing GPT-4.1 by 52%. Furthermore, the approach significantly improves generalization on benchmarks including EventQA, IFBench-800K, and AIME 2025, revealing a smooth interpolation relationship between model scale and cross-model collaborative performance.
📝 Abstract
Cooperative reasoning under incomplete information remains challenging for both humans and multi-agent systems. The card game Hanabi embodies this challenge, requiring theory-of-mind reasoning and strategic communication. We benchmark 17 state-of-the-art LLM agents in 2-5 player games and study the impact of context engineering across model scales (4B to 600B+) to understand persistent coordination failures and robustness to scaffolding: from a minimal prompt with only explicit card details (Watson setting), to scaffolding with programmatic, Bayesian-motivated deductions (Sherlock setting), to multi-turn state tracking via working memory (Mycroft setting). We show that (1) agents can maintain an internal working memory for state tracking and (2) cross-play performance between different LLMs smoothly interpolates with model strength. In the Sherlock setting, the strongest reasoning models exceed 15 points on average across player counts, yet still trail experienced humans and specialist Hanabi agents, both consistently scoring above 20. We release the first public Hanabi datasets with annotated trajectories and move utilities: (1) HanabiLogs, containing 1,520 full game logs for instruction tuning, and (2) HanabiRewards, containing 560 games with dense move-level value annotations for all candidate moves. Supervised and RL finetuning of a 4B open-weight model (Qwen3-Instruct) on our datasets improves cooperative Hanabi play by 21% and 156% respectively, bringing performance to within ~3 points of a strong proprietary reasoning model (o4-mini) and surpassing the best non-reasoning model (GPT-4.1) by 52%. The HanabiRewards RL-finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group-guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10.