🤖 AI Summary
The existing ARC-AGI benchmark inadequately captures high-order fluid intelligence and lacks fine-grained evaluation of abstract reasoning and out-of-distribution generalization.
Method: We introduce ARC-AGI-2—the first benchmark explicitly designed to assess fluid intelligence under high cognitive complexity—while preserving the input-output task paradigm. Leveraging principles from cognitive science and program synthesis, we systematically redesign tasks to be human-solvable yet resistant to generalization by current AI systems, including state-of-the-art large language models.
Contribution/Results: Extensive human benchmarking confirms high solvability (mean accuracy >90%), whereas SOTA models achieve <20%, markedly widening the human–AI performance gap. ARC-AGI-2 thus provides a more rigorous, interpretable, and cognitively grounded evaluation metric for delineating the capabilities—and limitations—of artificial general intelligence, enabling both precise capability mapping and advanced training paradigms.
📝 Abstract
The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.