arxiv: 2505.11831 · v2 · submitted 2025-05-17 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet , Mike Knoop , Gregory Kamradt , Bryan Landers , Henry Pinkard

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords ARC-AGI-2benchmarkabstract reasoningfluid intelligenceAI evaluationproblem solvingcognitive assessment

0 comments

The pith

ARC-AGI-2 introduces an expanded set of tasks to evaluate higher levels of abstract reasoning in AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARC-AGI-2, an upgraded benchmark that builds on the 2019 ARC-AGI by adding new tasks for finer assessment of fluid intelligence. These tasks maintain the input-output format but target more complex problem-solving with minimal prior knowledge required. Human testing results show that people can reliably solve them, creating a clear baseline. Current AI systems, however, perform poorly, indicating the benchmark's challenge for frontier models. This provides a tool to measure progress toward more general AI capabilities.

Core claim

ARC-AGI-2 preserves the core task format of input-output pairs but incorporates a newly curated and expanded set of tasks designed to assess abstract reasoning and problem-solving at higher levels of fluid intelligence. Extensive human testing provides a robust baseline showing accessibility to humans yet difficulty for current AI.

What carries the argument

The ARC-AGI-2 task collection, consisting of novel grid-based puzzles that test core reasoning abilities without reliance on specific prior knowledge.

If this is right

Researchers can use ARC-AGI-2 to obtain more granular signals on AI reasoning progress.
The benchmark enables continuity in evaluation while increasing the cognitive demand.
Human performance baselines allow direct comparison with AI results.
It highlights the gap between human fluid intelligence and current AI capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If AI solves ARC-AGI-2, it may indicate advances in general problem-solving transferable to new domains.
The design could influence future benchmarks to focus more on minimal-knowledge tasks.
This might encourage development of AI that relies less on memorized patterns and more on on-the-fly abstraction.

Load-bearing premise

The selected tasks genuinely require higher levels of fluid intelligence with only minimal prior knowledge, and the human testing protocol yields a reliable baseline.

What would settle it

Demonstration that current AI systems achieve human-level performance on the new ARC-AGI-2 tasks would falsify the claim of increased difficulty for AI while remaining accessible to humans.

read the original abstract

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARC-AGI-2 is a straightforward extension of the 2019 benchmark with new tasks and human baselines, but the writeup gives almost no concrete details on curation or results.

read the letter

The main point is that this paper releases an expanded task set for ARC-AGI-2 along with human performance numbers to mark a higher difficulty level while keeping the original grid input-output format. That continuity is the practical win, since groups already working with the first version can slot in the new tasks without rewriting their pipelines. The human baseline is also a clear step forward because it gives a reference point that current models are supposed to fall short of. What the paper does well is state the motivation plainly: the original benchmark has become too easy for frontier systems, so a harder version is needed to keep measuring abstract reasoning progress. The authors avoid overclaiming by sticking to description rather than new theory or proofs. The soft spots are mostly about missing substance. The abstract talks about extensive human testing and a robust baseline, yet supplies no sample sizes, testing protocol, actual scores, or even sample tasks. Without those, it is hard to judge whether the new tasks really isolate higher fluid intelligence or simply filter for things today's models have not been trained on. Task selection criteria are also not shown, which leaves the central claim about granularity open to question. The citation pattern is clean and points back to the 2019 paper without padding. This is a benchmark announcement paper, so it is mainly for researchers who build or use reasoning evaluations and want the next yardstick to work from. A reader who cares about how labs measure general intelligence would get value once the full task list and data are public. I would send it to peer review because the underlying benchmark idea has field-level impact and the authors can fix the documentation gaps in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ARC-AGI-2 as an upgraded benchmark extending the 2019 ARC-AGI. It preserves the input-output pair task format while adding a newly curated and expanded task set intended to deliver finer-grained measurement of abstract reasoning and problem-solving at higher levels of fluid intelligence. The paper presents extensive human testing results to establish a baseline showing the tasks remain accessible to humans yet difficult for current AI systems, positioning ARC-AGI-2 as a next-generation tool for tracking progress toward more general AI capabilities.

Significance. If the new tasks genuinely isolate higher fluid intelligence with minimal prior knowledge and the human baselines prove reliable, ARC-AGI-2 would supply a valuable, more granular instrument for evaluating frontier AI reasoning. The continuity with the original format and the provision of human data would help researchers quantify incremental gains beyond current AI performance levels.

major comments (2)

[Abstract] Abstract: the claim that the new tasks and human results supply a 'robust baseline' is unsupported because the manuscript provides no methodology details, sample sizes, quantitative performance statistics, or statistical analysis of the human data.
[Task Curation] The central assertion that the newly selected tasks require higher levels of fluid intelligence with only minimal prior knowledge lacks supporting evidence such as task categorization by cognitive demand, pilot testing results, or comparison metrics against ARC-AGI-1 tasks.

minor comments (1)

[Human Testing] The manuscript should include a dedicated section or appendix with the full human testing protocol, participant demographics, and raw or summarized performance numbers to allow independent verification of the baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on ARC-AGI-2. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the new tasks and human results supply a 'robust baseline' is unsupported because the manuscript provides no methodology details, sample sizes, quantitative performance statistics, or statistical analysis of the human data.

Authors: We agree that the abstract as currently written does not include these supporting details, which weakens the 'robust baseline' phrasing. The full manuscript contains a Human Evaluation section describing the testing protocol, but we acknowledge it may not have been sufficiently summarized or statistically detailed for the abstract's claim. In the revised version we will (1) expand the abstract to briefly report sample size, aggregate human performance metrics, and a note on the analysis performed, and (2) add a concise summary table of human results to the main text if not already present. This directly addresses the concern. revision: yes
Referee: [Task Curation] The central assertion that the newly selected tasks require higher levels of fluid intelligence with only minimal prior knowledge lacks supporting evidence such as task categorization by cognitive demand, pilot testing results, or comparison metrics against ARC-AGI-1 tasks.

Authors: We accept that the current manuscript does not provide explicit supporting evidence for the claim of higher fluid intelligence demand. While task selection followed the same minimal-prior-knowledge principle as ARC-AGI-1, we did not include the requested categorization, pilot data, or direct comparison metrics. In the revision we will add a dedicated subsection on task curation that reports (a) expert categorization by cognitive demand, (b) pilot testing outcomes, and (c) quantitative comparisons (e.g., solution times and error patterns) with ARC-AGI-1 tasks. This will supply the evidence the referee correctly notes is missing. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a benchmark announcement paper that introduces ARC-AGI-2 as a new task set with accompanying human performance data. It contains no equations, no fitted parameters, no predictive derivations, and no load-bearing logical steps that reduce to self-definitions, self-citations, or ansatzes. All central claims are descriptive statements about task curation and empirical human testing results, which are presented directly rather than derived from prior internal assumptions. Historical references to the 2019 ARC-AGI are contextual background only and do not serve as justification for any derivation within this paper. The work is therefore self-contained with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contains no mathematical derivations, fitted constants, or postulated entities; it rests only on the domain assumption that the curated tasks measure fluid intelligence.

axioms (1)

domain assumption The selected tasks require only minimal prior knowledge yet probe higher levels of fluid intelligence.
Invoked in the abstract when describing task design and human accessibility.

pith-pipeline@v0.9.0 · 5501 in / 1096 out tokens · 36045 ms · 2026-05-15T16:45:48.667064+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ARC-AGI-2 ... newly curated and expanded set of tasks ... granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

all tasks require only elementary Core Knowledge ... no specialized world knowledge
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

human testing ... robust baseline ... accessibility to human intelligence, yet difficulty for current AI systems

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Don't Pause! Every prediction matters in a streaming video
cs.CV 2026-04 unverdicted novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
SASAV: Self-Directed Agent for Scientific Analysis and Visualization
cs.GR 2026-04 unverdicted novelty 7.0

SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
Less is More: Recursive Reasoning with Tiny Networks
cs.LG 2025-10 unverdicted novelty 7.0

TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
Counting as a minimal probe of language model reliability
cs.CL 2026-05 unverdicted novelty 6.0

Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
cs.LG 2026-04 unverdicted novelty 6.0

C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
cs.CV 2026-04 unverdicted novelty 6.0

VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
cs.AI 2025-06 unverdicted novelty 6.0

LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
cs.AI 2026-05 unverdicted novelty 5.0

A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
cs.LG 2026-04 unverdicted novelty 5.0

KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
cs.AI 2026-04 unverdicted novelty 5.0

Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.
Hierarchical Reasoning Model
cs.AI 2025-06 unverdicted novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 20 Pith papers

[1]

ARC Prize - Leaderboard.https://arcprize.org/leaderboard

work page
[2]

ARC Prize - Policy.https://arcprize.org/policy

work page
[3]

Kaggle competition

Abstraction and Reasoning Challenge.https://www.kaggle.com/competitions/ abstraction-and-reasoning-challenge, 2020. Kaggle competition. 12

work page 2020
[4]

Lab42 competi- tion

ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022. Lab42 competi- tion

work page 2022
[5]

Lab42 competi- tion

ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023. Lab42 competi- tion

work page 2023
[6]

Open source code for testing model baseline performance on ARC-AGI

ARC Prize - Model Baseline.https://github.com/arcprize/model_baseline, 2024. Open source code for testing model baseline performance on ARC-AGI

work page 2024
[7]

Kaggle competi- tion

ARC Prize 2024.https://www.kaggle.com/competitions/arc-prize-2024, 2024. Kaggle competi- tion

work page 2024
[9]

Arc prize foundation, 2024

ARC Prize Foundation. Arc prize foundation, 2024. A nonprofit organization dedicated to fostering open-source scientific progress through enduring AI benchmarks

work page 2024
[10]

On the measure of intelligence.https://arxiv.org/abs/2412.04604, 2019

François Chollet. On the measure of intelligence.https://arxiv.org/abs/2412.04604, 2019

work page arXiv 2019
[11]

Analyzing o3 and o4-mini with ARC-AGI.https://arcprize.org/blog/ oai-o3-pub-breakthrough, 2025

François Chollet. Analyzing o3 and o4-mini with ARC-AGI.https://arcprize.org/blog/ oai-o3-pub-breakthrough, 2025. ARC Prize Blog

work page 2025
[12]

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/blog/ analyzing-o3-with-arc-agi, 2024

Greg Kamradt. OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/blog/ analyzing-o3-with-arc-agi, 2024. ARC Prize Blog

work page 2024
[13]

Lake, and Todd M

Aysja Johnson, Wai Keen Vong, Brenden M. Lake, and Todd M. Gureckis. Fast and flexible: Human program induction in abstract reasoning tasks.CoRR, abs/2103.05823, 2021

work page arXiv 2021
[14]

Lake, and Todd M

Solim LeGris, Wai Keen Vong, Brenden M. Lake, and Todd M. Gureckis. H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark.https://arxiv.org/ abs/2409.01374, 2024. 13

work page arXiv 2024