Measuring General Intelligence with Generated Games

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess the general reasoning capabilities of large language models (LLMs), particularly in open-ended, dynamic decision-making contexts. Method: We introduce gg-bench—the first automated benchmark supporting infinite-generation and dynamically evolving unknown games. It leverages state-of-the-art LLMs (e.g., GPT-4o, Claude 3.7, o1) to autonomously design novel game rules and generate Gym-compatible environments; trains adversarial agents via self-play reinforcement learning; and evaluates real-time reasoning and decision-making through in-context learning and action-sequence prompting—establishing a closed-loop assessment pipeline. Contribution/Results: Our key innovation lies in unifying the LLM as game designer, environment implementer, and evaluation collaborator, enabling autonomous benchmark growth and generalization validation. Experiments show that SOTA general-purpose models achieve only 7–9% win rates on gg-bench, while specialized reasoning models reach 31–36%. All games, data pipelines, and evaluation code are publicly released, establishing a new paradigm for quantifying general intelligence.

Technology Category

Application Category

📝 Abstract
We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.
Problem

Research questions and friction points this paper is trying to address.

Evaluating general reasoning in language models using generated games
Creating dynamic benchmarks via LLM-generated game environments
Assessing model performance against RL agents in gameplay
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates novel game descriptions
LLM implements games as Gym environments
RL agents trained via self-play
🔎 Similar Papers
No similar papers found.