Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing benchmarks for world model state tracking predominantly rely on synthetic or linguistic data, which inadequately assess a model’s capacity to capture state evolution in realistic structured environments. This work introduces the first large-scale state tracking benchmark based on tens of millions of real chess games, requiring models to accurately reconstruct the final board state from sequences of legal moves and incorporating out-of-distribution random games to evaluate genuine understanding of state transition rules. Under a unified training protocol, recurrent architectures—including block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues—consistently outperform causal Transformers at 3M–8M parameters. Performance on real-game compositions saturates beyond 18M parameters, whereas random-game evaluation remains discriminative up to 40M parameters, underscoring the critical role of expressive state transition mechanisms.

📝 Abstract

World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

Problem

Research questions and friction points this paper is trying to address.

state tracking

world models

chess benchmark

out-of-distribution generalization

exact state prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

state tracking

world models

chess benchmark