Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether the capabilities of large language models stem primarily from memorization (crystallized intelligence) or reasoning (fluid intelligence). Using chess as a controlled testbed, the authors construct a classification of board positions varying in distributional distance from the training corpus, enabling the first quantitative disentanglement of fluid and crystallized intelligence in a structured environment. Through engine-based scalable evaluation, comparisons across multiple GPT generations, and reasoning-augmented inference, they find that model performance sharply declines as fluid intelligence demands increase, deteriorating to near-random levels on out-of-distribution tasks. Although reasoning augmentation provides measurable gains, its effectiveness diminishes with decreasing proximity to the training distribution, revealing fundamental limitations in the models’ capacity for systematic generalization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity--ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.

Problem

Research questions and friction points this paper is trying to address.

fluid intelligence

crystallized intelligence

large language models

systematic generalization

out-of-distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

fluid intelligence

crystallized intelligence

chess benchmark

out-of-distribution generalization

reasoning-augmented inference

🔎 Similar Papers

No similar papers found.

Authors to Follow