Pok'eChamp: an Expert-level Minimax Language Agent

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work introduces Pok’eChamp—the first real-time, two-player zero-sum game agent that seamlessly integrates large language models (LLMs) into minimax tree search for *Pokémon* battles, a partially observable, highly strategic domain. Methodologically, it leverages prompt engineering—without LLM fine-tuning—to enable LLM-driven action pruning, opponent modeling, and in-context value estimation, tightly coupled with a custom-built PokéBattle simulation engine. Key contributions include: (1) the first large-scale dataset of authentic human player matches (3M+ battles); (2) a skill-oriented, fine-grained evaluation benchmark; and (3) an open-source, enhanced battle engine. Experiments show GPT-4o achieves 76% win rate against the strongest LLM baseline and 84% against top rule-based systems; even an 8B-parameter Llama 3.1 variant attains 64%. Estimated Elo ratings range from 1300–1500, placing Pok’eChamp consistently within the top 30%–10% of human players.

Technology Category

Application Category

📝 Abstract

We introduce Pok'eChamp, a minimax agent powered by Large Language Models (LLMs) for Pok'emon battles. Built on a general framework for two-player competitive games, Pok'eChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate Pok'eChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, Pok'eChamp consistently outperforms the previous best LLM-based bot, Pok'ellmon powered by GPT-4o, with a 64% win rate. Pok'eChamp attains a projected Elo of 1300-1500 on the Pok'emon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pok'emon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pok'emon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multiagent problems. Videos, code, and dataset available at https://sites.google.com/view/pokechamp-llm.

Problem

Research questions and friction points this paper is trying to address.

Develops a minimax agent using LLMs for Pokémon battles.

Enhances minimax tree search by integrating LLM capabilities.

Creates benchmarks and datasets for evaluating battling skills.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance minimax tree search

LLMs replace key game modules

No additional LLM training required

🔎 Similar Papers

Policy Learning with a Language Bottleneck