🤖 AI Summary
This study investigates whether large language models (LLMs) can autonomously generate executable board game code from natural-language rule descriptions, aiming to establish an efficient, reusable LLM-assisted tabletop game development framework.
Method: We propose Boardwalk—a universal game interface—and anonymize game elements to mitigate pretraining knowledge bias. Using Claude, DeepSeek, and ChatGPT, we evaluate code generation across 12 board games under two settings: unconstrained free generation and Boardwalk API-constrained generation.
Contribution/Results: Under API constraints, Claude 3.7 Sonnet achieves a 55.6% zero-error generation rate—significantly outperforming free generation. This work provides the first systematic empirical analysis of LLMs’ capabilities and failure modes in rule-to-code translation, quantifying the critical impact of API design on generation fidelity. It establishes foundational evidence and methodological guidance for automated board game development.
📝 Abstract
Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.