Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
Existing benchmarks for world model state tracking predominantly rely on synthetic or linguistic data, which inadequately assess a model’s capacity to capture state evolution in realistic structured environments. This work introduces the first large-scale state tracking benchmark based on tens of millions of real chess games, requiring models to accurately reconstruct the final board state from sequences of legal moves and incorporating out-of-distribution random games to evaluate genuine understanding of state transition rules. Under a unified training protocol, recurrent architectures—including block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues—consistently outperform causal Transformers at 3M–8M parameters. Performance on real-game compositions saturates beyond 18M parameters, whereas random-game evaluation remains discriminative up to 40M parameters, underscoring the critical role of expressive state transition mechanisms.
📝 Abstract
World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.
Problem

Research questions and friction points this paper is trying to address.

state tracking
world models
chess benchmark
out-of-distribution generalization
exact state prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

state tracking
world models
chess benchmark
out-of-distribution generalization
recurrent architectures
🔎 Similar Papers
No similar papers found.