pith. machine review for the scientific record. sign in

arxiv: 2603.24621 · v2 · submitted 2026-03-24 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

ARC Prize Foundation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords ARC-AGIagentic intelligenceinteractive benchmarkcore knowledge priorsfluid intelligenceAI evaluationgoal inferenceplanning
0
0 comments X

The pith

ARC-AGI-3 introduces interactive environments where humans solve every task but current frontier AI systems score below 1 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARC-AGI-3 as a benchmark of turn-based abstract environments that require agents to explore, infer goals, build models of dynamics, and plan actions without instructions or language. It is constructed to rely solely on core knowledge priors and calibrated through repeated human testing so that the tasks stay novel. Humans achieve complete success across the full set of environments. Frontier AI systems, as measured in March 2026, remain below 1 percent success. The scoring framework compares AI performance directly to human action baselines to quantify adaptive efficiency on unseen problems.

Core claim

ARC-AGI-3 consists of novel, language-free interactive environments that test an agent's ability to explore, infer implicit goals, construct internal models of environment dynamics, and execute effective action sequences. Human test-takers solve 100 percent of the environments after calibration, while frontier AI systems score below 1 percent. The benchmark evaluates fluid adaptive efficiency on tasks that use only core knowledge priors and avoids any reliance on external knowledge or language.

What carries the argument

ARC-AGI-3 interactive benchmark, a collection of turn-based abstract environments whose difficulty is set by human performance baselines and core-knowledge priors.

If this is right

  • AI progress can be tracked by measuring how close agents come to human action efficiency on these novel tasks.
  • Success requires agents to perform goal inference and dynamic modeling without explicit training signals.
  • The efficiency-based scoring allows direct numerical comparison between AI agents and human baselines.
  • Passing the benchmark would demonstrate fluid intelligence on tasks that avoid language and memorized knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improvements on ARC-AGI-3 could indicate AI systems that handle uncertainty and novelty more robustly than current methods.
  • The benchmark may serve as a template for creating further interactive tests that isolate adaptive reasoning from language use.
  • Future versions could add multi-step planning requirements or partial observability to increase the challenge.

Load-bearing premise

The environments remain genuinely novel and language-free to AI systems once they have been calibrated only through human test-takers.

What would settle it

A frontier AI system reaching 50 percent or higher success on the full ARC-AGI-3 suite would show that the benchmark no longer separates current AI capabilities from human performance.

Figures

Figures reproduced from arXiv: 2603.24621 by ARC Prize Foundation.

Figure 1
Figure 1. Figure 1: Frontier AI performance on ARC-AGI since introduction in 2019. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Screenshot of ARC-AGI-3 environment ls20. 2.3.1 The Observation Space The agent views a 64x64 grid where each cell is one of 16 possible colors. A given grid state is called a “frame”. At each turn, the agent receives a frame or frame sequence. Frame sequences allow for non￾interactive animations (e.g., an object moving across the screen) between player turns. 2.3.2 The Action Space Each environment offers… view at source ↗
Figure 3
Figure 3. Figure 3: First level of ls20 in graph form. Notice the three repeating states – an artifact of the three-life mechanic of the level. Pwin for this level is exactly 1 in 355. 3.6 ARC-AGI-3 environment selection The ARC-AGI-3 benchmark consists of the following datasets: Public demonstration set. The public set is designed to demonstrate the ARC-AGI-3 environment format, while being accessible and engaging for human … view at source ↗
Figure 4
Figure 4. Figure 4: Action progression and RHAE scoring for environment [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Participant demographics. 5.3 Human performance on ARC-AGI-3 In total, we recorded 486 unique participants across 414 candidate environments. This resulted in 2,893 total environment attempts. 0% 50% 100% 150% 200% 250% 300% 0 50 100 150 200 250 Matches median (100%) Efficiency vs Median Count (level completions) Per-Level Efficiency Distribution Relative to Human Baseline [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 6
Figure 6. Figure 6: Per-level efficiency distribution relative to the median human baseline across all public environ [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Time spent on environments by outcome, split between successful runs (“correct”) and unsuccessful [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Total actions by level for environment ls20. 6 ARC-AGI-3 pre-launch testing Unlike ARC-AGI-1 and 2, we decided to release previews of ARC-AGI-3 prior to the full launch in order to guide our final benchmark design. This gave us critical feedback on what environments were easier and more engaging, and enabled early AI tests to vet our design choices. To incentivize this, we both hosted an agent preview comp… view at source ↗
read the original abstract

We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Like its predecessors ARC-AGI-1 and 2, ARC-AGI-3 focuses entirely on evaluating fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge. ARC-AGI-3 environments only leverage Core Knowledge priors and are difficulty-calibrated via extensive testing with human test-takers. Our testing shows humans can solve 100% of the environments, in contrast to frontier AI systems which, as of March 2026, score below 1%. In this paper, we present the benchmark design, its efficiency-based scoring framework grounded in human action baselines, and the methodology used to construct, validate, and calibrate the environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ARC-AGI-3, an interactive benchmark of novel abstract turn-based environments that test agentic intelligence via exploration, goal inference, internal model construction, and planning without language or external knowledge. Environments are restricted to Core Knowledge priors and calibrated through human testing; the central empirical claim is that humans solve 100% of tasks while frontier AI systems score below 1% as of March 2026. The manuscript describes the benchmark design, an efficiency-based scoring framework grounded in human action baselines, and the methodology for construction, validation, and calibration.

Significance. If the performance gap is shown to arise from matched protocols rather than evaluation artifacts, the benchmark would provide a valuable language-free test of fluid adaptive efficiency, extending the ARC-AGI series and offering a concrete challenge for agentic capabilities in frontier models.

major comments (2)
  1. [Methodology / Evaluation Protocol] Methodology section (evaluation protocol): the description of the AI testing setup omits the precise observation/action interface, episode length limits, number of trials per environment, and prompting regime supplied to frontier models. Because the <1% claim is load-bearing for the headline result, any deviation from the human calibration protocol (e.g., richer state representations or additional trials) could render the reported gap non-comparable.
  2. [Abstract and Results] Results and abstract: the 100% human / <1% AI figures are stated without reporting the number of environments, human sample size, identity of the specific frontier models tested, or any statistical controls for variance. This absence leaves the central empirical claim without sufficient verifiable support.
minor comments (2)
  1. [Scoring Framework] The efficiency-based scoring framework would benefit from an explicit equation or pseudocode showing how human action baselines are normalized into the final score.
  2. [Benchmark Design] Figure captions and environment descriptions could be expanded to clarify the exact turn-based interaction loop for readers unfamiliar with prior ARC-AGI versions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of the evaluation protocol and empirical results.

read point-by-point responses
  1. Referee: Methodology section (evaluation protocol): the description of the AI testing setup omits the precise observation/action interface, episode length limits, number of trials per environment, and prompting regime supplied to frontier models. Because the <1% claim is load-bearing for the headline result, any deviation from the human calibration protocol (e.g., richer state representations or additional trials) could render the reported gap non-comparable.

    Authors: We agree that the current description of the AI evaluation protocol is insufficiently detailed for full reproducibility and comparability. The manuscript provides a high-level overview but does not enumerate the exact observation and action spaces, maximum episode lengths, trial counts per environment, or the precise prompting format used with frontier models. In the revised manuscript we will expand the Methodology section to specify these parameters explicitly and to document how they were aligned with the human calibration protocol (including identical state representations and trial limits). This will allow readers to verify that the reported performance gap reflects matched conditions rather than protocol differences. revision: yes

  2. Referee: Results and abstract: the 100% human / <1% AI figures are stated without reporting the number of environments, human sample size, identity of the specific frontier models tested, or any statistical controls for variance. This absence leaves the central empirical claim without sufficient verifiable support.

    Authors: We acknowledge that the abstract and results sections currently present the headline performance figures without the supporting quantitative details the referee requests. The manuscript states the aggregate outcomes but does not list the total number of environments, the size of the human test-taker cohort, the exact frontier models evaluated as of March 2026, or variance statistics. In the revision we will add these elements: the benchmark size, human sample size and recruitment criteria, the specific model versions tested, and appropriate statistical controls (e.g., per-environment success rates and confidence intervals). These additions will provide the verifiable support needed for the central claim while preserving the existing narrative. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or performance claims

full rationale

The paper presents ARC-AGI-3 as an empirical benchmark whose environments are constructed and calibrated through separate human testing protocols, with reported human (100%) and AI (<1%) scores arising from distinct evaluation runs rather than any derivation or equation that reduces one to the other by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; references to prior ARC versions serve only as background and do not justify the new calibration or scoring results. The efficiency-based scoring framework is described as grounded in observed human action baselines without evidence that the headline performance gap is mathematically forced by the calibration inputs themselves. The derivation chain is therefore self-contained as an independent empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that the new environments are strictly limited to Core Knowledge priors and that human calibration ensures they remain novel for AI systems.

axioms (2)
  • domain assumption Environments only leverage Core Knowledge priors
    Explicitly stated in the abstract as the basis for avoiding language and external knowledge.
  • domain assumption Human test-taker results provide a valid difficulty calibration and efficiency baseline
    The scoring framework and 100% human solve rate depend on this calibration process described in the abstract.

pith-pipeline@v0.9.0 · 5442 in / 1288 out tokens · 51170 ms · 2026-05-15T00:04:04.422243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

    cs.AI 2026-05 conditional novelty 7.0

    State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...

  2. Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.

  3. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  4. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  5. Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

    cs.AI 2026-05 unverdicted novelty 6.0

    Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.

  6. Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.

  7. Counting as a minimal probe of language model reliability

    cs.CL 2026-05 unverdicted novelty 6.0

    Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.

  8. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  9. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  10. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 8 Pith papers · 3 internal anchors

  1. [1]

    ARC Prize 2024 Competition.https://arcprize.org/competitions/2024, 2024

  2. [2]

    ARC Prize 2025 Competition.https://arcprize.org/competitions/2025, 2025

  3. [3]

    Founders: Mike Knoop, François Chollet

    ARC Prize Foundation.https://arcprize.org/, 2026. Founders: Mike Knoop, François Chollet. Operations: Bryan Landers, Greg Kamradt

  4. [4]

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018

  5. [5]

    ARC-AGI Community Leaderboard.https://github.com/arcprize/ ARC-AGI-Community-Leaderboard, 2026

    ARC Prize Foundation. ARC-AGI Community Leaderboard.https://github.com/arcprize/ ARC-AGI-Community-Leaderboard, 2026

  6. [6]

    ARC-AGI Toolkit.https://github.com/arcprize/ARC-AGI, 2026

    ARC Prize Foundation. ARC-AGI Toolkit.https://github.com/arcprize/ARC-AGI, 2026

  7. [7]

    Gemini 3 Deep Think Preview Verification on ARC-AGI-2.https:// huggingface.co/datasets/arcprize/arc_agi_v2_public_eval, 2026

    ARC Prize Foundation. Gemini 3 Deep Think Preview Verification on ARC-AGI-2.https:// huggingface.co/datasets/arcprize/arc_agi_v2_public_eval, 2026. 22

  8. [8]

    On the Measure of Intelligence

    François Chollet. On the Measure of Intelligence.https://arxiv.org/abs/1911.01547, 2019

  9. [9]

    OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/ blog/oai-o3-pub-breakthrough, December 2024

    François Chollet. OpenAI o3 Breakthrough High Score on ARC-AGI-Pub.https://arcprize.org/ blog/oai-o3-pub-breakthrough, December 2024

  10. [10]

    AbstractionandReasoningChallenge

    FrançoisChollet, KatherineTong, WalterReade, andJuliaElliott. AbstractionandReasoningChallenge. https://kaggle.com/competitions/abstraction-and-reasoning-challenge, 2020. Kaggle

  11. [11]

    Hill-climbing arc-agi-3, 2026

    Alexis Fox, Junlin Wang, Paul Rosu, and Bhuwan Dhingra. Hill-climbing arc-agi-3, 2026

  12. [12]

    Post on LRM automation discovering novel results in quantum physics.https://x.com/ hsu_steve/status/1996034522308026435, 2025

    Steve Hsu. Post on LRM automation discovering novel results in quantum physics.https://x.com/ hsu_steve/status/1996034522308026435, 2025

  13. [13]

    ARC-AGI-3 Preview: 30-Day Learnings.https://arcprize.org/blog/ arc-agi-3-preview-30-day-learnings, August 2025

    Greg Kamradt. ARC-AGI-3 Preview: 30-Day Learnings.https://arcprize.org/blog/ arc-agi-3-preview-30-day-learnings, August 2025

  14. [14]

    Arcgentica: ARC-AGI-3 Agent Harness Built on the Agentica SDK

    Samuel Knutsen and Victoria Klein. Arcgentica: ARC-AGI-3 Agent Harness Built on the Agentica SDK. https://github.com/symbolica-ai/ARC-AGI-3-Agents, 2026

  15. [15]

    ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022

    Lab42. ARCathon 2022.https://lab42.global/past-challenges/2022-arcathon/, 2022

  16. [16]

    ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023

    Lab42. ARCathon 2023.https://lab42.global/past-challenges/2023-arcathon/, 2023

  17. [17]

    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the...

  18. [18]

    ARC3 Solution.https://github.com/DriesSmit/ARC3-solution, 2025

    Dries Smit. ARC3 Solution.https://github.com/DriesSmit/ARC3-solution, 2025

  19. [19]

    Sorokin and Jean-Francois Puget

    I. Sorokin and Jean-Francois Puget. NVARC Solution to ARC-AGI-2 2025.https://drive.google. com/file/d/1vkEluaaJTzaZiJL69TkZovJUkPSDH5Xc/view, 2025

  20. [20]

    Spelke and Katherine D

    Elizabeth S. Spelke and Katherine D. Kinzler. Core knowledge.Developmental science, pages 89–96, 2007

  21. [21]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need.https://arxiv.org/abs/1706.03762, 2017

  22. [22]

    ARC-AGI-3 Agents.https://github.com/wd13ca/ARC-AGI-3-Agents, 2025

    wd13ca. ARC-AGI-3 Agents.https://github.com/wd13ca/ARC-AGI-3-Agents, 2025

  23. [23]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https: //arxiv.org/abs/2201.11903, 2022. 23