🤖 AI Summary
This work investigates whether large language models (LLMs) possess human-like progressive mental representation capabilities—i.e., the ability to dynamically construct and iteratively refine internal cognitive models—rather than relying solely on static, pre-trained pattern matching.
Method: We introduce the first “Progressive Mental Modeling Evaluation Paradigm,” which (1) incrementally injects problem information, (2) tracks latent representation evolution via probing and interpretability techniques, and (3) integrates stepwise prompting, cross-modal (text-only vs. multimodal) comparisons, and rigorous evaluation on the MathWorld mathematical reasoning benchmark.
Contribution/Results: Systematic experiments across model scales and task difficulties reveal that current LLMs—and multimodal LMs—fail to robustly build or sustainably update internal mental models; their reasoning remains fundamentally grounded in static pattern matching. This challenges the dominant “reasoning-as-pattern-matching” hypothesis and establishes a novel, empirically grounded paradigm for studying LLM cognition.
📝 Abstract
Language Models (LMs) have demonstrated impressive capabilities in solving complex reasoning tasks, particularly when prompted to generate intermediate explanations. However, it remains an open question whether these intermediate reasoning traces represent a dynamic, evolving thought process or merely reflect sophisticated pattern recognition acquired during large scale pre training. Drawing inspiration from human cognition, where reasoning unfolds incrementally as new information is assimilated and internal models are continuously updated, we propose to delve deeper into the mental model of various LMs. We propose a new way to assess the mental modeling of LMs, where they are provided with problem details gradually, allowing each new piece of data to build upon and refine the model's internal representation of the task. We systematically compare this step by step mental modeling strategy with traditional full prompt methods across both text only and vision and text modalities. Experiments on the MathWorld dataset across different model sizes and problem complexities confirm that both text-based LLMs and multimodal LMs struggle to create mental representations, questioning how their internal cognitive processes work.