ARC Prize 2025: Technical Report

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited few-shot generalization capability of AI systems on entirely novel tasks by proposing and validating a key paradigm—termed the “refinement loop”—centered around the ARC-AGI-2 benchmark. The approach integrates evolutionary program synthesis, a compact 7M-parameter neural network, feedback-driven iterative optimization, and an interactive reasoning architecture to substantially enhance performance on unseen abstract reasoning problems. Evaluated on the ARC-AGI-2 private test set, the system achieves 24% accuracy, highlighting how current large models are constrained by knowledge coverage and benchmark contamination. The study advocates for establishing ARC-AGI as an industry standard and pre-announces ARC-AGI-3, a next-generation benchmark designed to support exploration, planning, and memory.

Technology Category

Application Category

📝 Abstract

The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks, a core aspect of intelligence. The ARC Prize 2025 global competition targeted the newly released ARC-AGI-2 dataset, which features greater task complexity compared to its predecessor. The Kaggle competition attracted 1,455 teams and 15,154 entries, with the top score reaching 24% on the ARC-AGI-2 private evaluation set. Paper submissions nearly doubled year-over-year to 90 entries, reflecting the growing research interest in fluid intelligence and abstract reasoning. The defining theme of 2025 is the emergence of the refinement loop -- a per-task iterative program optimization loop guided by a feedback signal. Refinement loops come in a variety of forms, in particular evolutionary program synthesis approaches and application-layer refinements to commercial AI systems. Such refinement loops are also possible in weight space, as evidenced by zero-pretraining deep learning methods which are now achieving competitive performance with remarkably small networks (7M parameters). In parallel, four frontier AI labs (Anthropic, Google DeepMind, OpenAI, and xAI) reported ARC-AGI performance in public model cards in 2025, establishing ARC-AGI as an industry standard benchmark for AI reasoning. However, our analysis indicates that current frontier AI reasoning performance remains fundamentally constrained to knowledge coverage, giving rise to new forms of benchmark contamination. In this paper, we survey the top-performing methods, examine the role of refinement loops in AGI progress, discuss knowledge-dependent overfitting, and preview ARC-AGI-3, which introduces interactive reasoning challenges that require exploration, planning, memory, goal acquisition, and alignment capabilities.

Problem

Research questions and friction points this paper is trying to address.

few-shot generalization

abstract reasoning

fluid intelligence

benchmark contamination

interactive reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

refinement loop

few-shot generalization

evolutionary program synthesis

zero-pretraining deep learning

interactive reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow