ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The existing ARC-AGI benchmark inadequately captures high-order fluid intelligence and lacks fine-grained evaluation of abstract reasoning and out-of-distribution generalization. Method: We introduce ARC-AGI-2—the first benchmark explicitly designed to assess fluid intelligence under high cognitive complexity—while preserving the input-output task paradigm. Leveraging principles from cognitive science and program synthesis, we systematically redesign tasks to be human-solvable yet resistant to generalization by current AI systems, including state-of-the-art large language models. Contribution/Results: Extensive human benchmarking confirms high solvability (mean accuracy >90%), whereas SOTA models achieve <20%, markedly widening the human–AI performance gap. ARC-AGI-2 thus provides a more rigorous, interpretable, and cognitively grounded evaluation metric for delineating the capabilities—and limitations—of artificial general intelligence, enabling both precise capability mapping and advanced training paradigms.

Technology Category

Application Category

📝 Abstract

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI fluid intelligence via novel tasks

Upgrading benchmark for higher cognitive complexity

Measuring progress towards human-like AI capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Upgraded benchmark ARC-AGI-2 for AI reasoning

Expanded task set for abstract problem-solving

Human baseline comparison for AI difficulty

🔎 Similar Papers

No similar papers found.

Authors to Follow