🤖 AI Summary
This work addresses the lack of interpretable internal reasoning mechanisms in current AI systems when tackling visual abstract reasoning tasks such as the Abstraction and Reasoning Corpus (ARC), where models often rely solely on behavioral statistical matching rather than human-like rule induction. To overcome this limitation, the paper introduces a novel paradigm—“reasoning as a modality”—which explicitly decouples a global controller from a grid-based workspace at the architectural level. This is achieved through role-separated Transformer modules and an iterative rule execution mechanism, enabling controller-driven, interpretable reasoning. Evaluated under the VARC (Visual Abstract Reasoning with Centralized protocol) framework, the proposed model achieves a 62.6% accuracy on the ARC-1 benchmark, surpassing both the human average performance (60.2%) and all existing methods, while demonstrating more coherent and structurally consistent rule application.
📝 Abstract
The Abstraction and Reasoning Corpus (ARC) provides a compact laboratory for studying abstract reasoning, an ability central to human intelligence. Modern AI systems, including LLMs and ViTs, largely operate as sequence-of-behavior prediction machines: they match observable behaviors by modeling token statistics without a persistent, readable mental state. This creates a gap with human-like behavior: humans can explain an action by decoding internal state, while AI systems can produce fluent post-hoc rationalizations that are not grounded in such a state. We hypothesize that reasoning is a modality: reasoning should exist as a distinct channel separate from the low-level workspace on which rules are applied. To test this hypothesis, on solving ARC tasks as a visual reasoning problem, we designed a novel role-separated transformer block that splits global controller tokens from grid workspace tokens, enabling iterative rule execution. Trained and evaluated within the VARC vision-centric protocol, our method achieved 62.6% accuracy on ARC-1, surpassing average human performance (60.2%) and outperforming prior methods significantly. Qualitatively, our models exhibit more coherent rule-application structure than the dense ViT baseline, consistent with a shift away from plausible probability blobs toward controller-driven reasoning.