Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

Liew Keong Han

arxiv: 2605.25931 · v1 · pith:5K7LWXVUnew · submitted 2026-05-25 · 💻 cs.AI

Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

Liew Keong Han This is my paper

Pith reviewed 2026-06-29 21:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords ARC-AGI-3benchmark validityepistemic agentsexploration strategiesspeed-depth trade-offAERA agentinteractive reasoningnon-intelligent heuristics

0 comments

The pith

The public ARC-AGI-3 evaluation set can be solved entirely by non-intelligent strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that all 25 public games in ARC-AGI-3 yield to basic tactics: 10 via a single blind step, 5 after one probe, 1 via repeated presses, 1 via diverse moves, and 8 via repeated actions over enough steps. A null-coordinate library flaw bypasses 18 games in one move. This means the public set cannot separate intelligent exploration from trivial heuristics. The authors introduce the AERA agent with EXPLORE-VERIFY-PLAN phases that scores 0.2116 RHAE on the public games while random baselines score zero. They frame performance via a speed-depth trade-off in which deviation from the efficiency frontier incurs a quadratic penalty.

Core claim

Every one of the 25 public ARC-AGI-3 games is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics; the private 55-game evaluation is the only genuine intelligence test. AERA achieves RHAE=0.2116 (4/25 solved) on the public set while random and no-explore baselines score 0.0000.

What carries the argument

The Speed--Depth trade-off framework, under which RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain, together with the three-phase EXPLORE / VERIFY / PLAN structure of the AERA agent.

If this is right

The public ARC-AGI-3 set fails to measure the exploration the benchmark claims to require.
Only the private 55-game evaluation functions as a genuine test of intelligent reasoning.
The EXPLORE-before-PLAN structure enables small models to outperform random and no-explore baselines on tasks that reward information gathering.
Performance depends on the interaction between model capability and the chosen exploration strategy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vulnerability analysis could be applied to other interactive reasoning benchmarks that permit repeated actions or coordinate exploits.
Benchmarks would benefit from private evaluation sets released only after public ones are hardened against simple heuristics.
The speed-depth framework may predict performance drops in any environment where information gain trades off against action cost.

Load-bearing premise

The listed strategies such as single blind steps, repeated presses, and null-coordinate bypass do not qualify as intelligent exploration.

What would settle it

A demonstration that solving any of the 25 public games with the listed simple strategies requires task comprehension that transfers to unseen games rather than blind repetition.

Figures

Figures reproduced from arXiv: 2605.25931 by Liew Keong Han.

**Figure 3.** Figure 3: AERA pseudocode. LLM.explore returns structured HYPOTHESIS/UNCERTAIN/NEXT ACTION. The entropy proxy len(uncertain) <= theta gates the EXPLORE→VERIFY transition. How EXPLORE maps to BFS pre-solve. In the competition kernel, the EXPLORE phase is replaced by a breadth-first search over the game’s action space at episode start: the solver exhaustively tries action sequences up to depth d offline, caching all g… view at source ↗

**Figure 4.** Figure 4: Per-step belief entropy during the EXPLORE phase (EXP-002, Qwen2.5-0.5B). Star marks [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The public ARC-AGI-3 games look solvable by simple heuristics, but the paper's classification of those heuristics as non-intelligent lacks a clear decision rule.

read the letter

The central claim is that all 25 public games have non-intelligent solutions: 10 by one blind step, 5 by one probe, and the rest by repeated actions or a null-coordinate bypass. If the enumeration holds, the public set stops being a test of exploration. The authors also give AERA, a three-phase agent that reaches RHAE 0.2116 on those games while random baselines stay at zero, and they sketch a speed-depth trade-off that produces a quadratic penalty under convexity.

The enumeration itself is new and concrete. Checking every game and reporting the exact strategy counts plus the null bypass is useful work. The AERA design and the RHAE metric give a practical way to measure the explore-then-plan split, and the private-set result of 0.30 shows the agent is not just tuned to the public games.

The soft spot is the missing definition of non-intelligent. The paper lists the strategies but supplies no information-theoretic cutoff, no comparison to an optimal policy, and no exclusion of systematic search. Without that, the claim that these are trivial rather than intelligent remains an assertion. The abstract gives no traces or error analysis, so the counts cannot be checked directly. The convexity assumption and the quadratic RHAE form sit downstream of the same classification, so they inherit the ambiguity.

This paper is for people who design or critique interactive reasoning benchmarks. Anyone working on ARC-style tasks or epistemic agents will want to see whether the public set really fails to discriminate. It deserves peer review so referees can examine the full traces, the appendix proof, and any formal criterion the authors can supply for the intelligent/non-intelligent line.

Referee Report

3 major / 1 minor

Summary. The paper systematically investigates all 25 public ARC-AGI-3 games and concludes that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability bypasses 18 games in 1 step. This implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics, with the private 55-game evaluation being the only genuine intelligence test. The authors present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. They formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain.

Significance. If the result holds after addressing verification issues, the paper's significance lies in its critique of benchmark validity for measuring exploration in AI agents, potentially influencing how future benchmarks are designed to better test intelligent strategies. The AERA framework and the model-capability x exploration interaction provide practical insights, and the achievement of RHAE=0.30 on the private 55-game evaluation with the code track entry is a notable strength demonstrating applicability beyond the public set. The formal Speed-Depth trade-off offers a theoretical contribution that could be built upon if the convexity assumption is clearly established.

major comments (3)

[Benchmark validity analysis (abstract and full text)] The central claim that the 25 public games cannot discriminate intelligent exploration from trivial heuristics relies on classifying the listed strategies as 'non-intelligent', but the manuscript provides no formal definition or decision procedure for 'intelligent exploration' (e.g., no information-theoretic threshold, no comparison to optimal policy). This makes the classification an untestable assertion and is load-bearing for the benchmark critique.
[Results on strategy counts and RHAE scores (abstract)] The abstract asserts the strategy counts and RHAE scores but supplies no verification data, game traces, or error analysis; this is a load-bearing issue for the soundness of the benchmark critique.
[Speed--Depth trade-off framework (formalisation section)] The quadratic RHAE form is stated to emerge from the Speed-Depth trade-off under the convexity assumption (proved for a class of environments in the Appendix); without the appendix proof details, it is unclear whether the form is independently derived or tied to fitted quantities, affecting the formal contribution.

minor comments (1)

[Abstract] The acronym RHAE is used without initial expansion in the provided abstract, which may confuse readers unfamiliar with the term.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify key areas where additional clarity will strengthen the manuscript. We will revise to provide a formal definition of intelligent exploration, include verification data and traces, and expand the appendix derivation. These changes preserve the core claims while improving verifiability and formal precision.

read point-by-point responses

Referee: [Benchmark validity analysis (abstract and full text)] The central claim that the 25 public games cannot discriminate intelligent exploration from trivial heuristics relies on classifying the listed strategies as 'non-intelligent', but the manuscript provides no formal definition or decision procedure for 'intelligent exploration' (e.g., no information-theoretic threshold, no comparison to optimal policy). This makes the classification an untestable assertion and is load-bearing for the benchmark critique.

Authors: We agree a formal definition strengthens the argument. In revision we will define non-intelligent strategies as those implementable by a fixed policy with no environmental model or use of observations (e.g., constant action repetition or single blind step). Intelligent exploration is defined as any policy that conditions actions on gathered observations to increase expected task progress. This supplies an explicit decision procedure based on whether the strategy requires epistemic updating. The empirical classification of the 25 games remains unchanged under this definition. revision: yes
Referee: [Results on strategy counts and RHAE scores (abstract)] The abstract asserts the strategy counts and RHAE scores but supplies no verification data, game traces, or error analysis; this is a load-bearing issue for the soundness of the benchmark critique.

Authors: The counts were obtained by exhaustive per-game inspection using the released code. We will add an appendix containing: (i) the strategy label assigned to each of the 25 games with the justifying trace or action sequence, (ii) representative execution traces for each category, and (iii) a short error analysis confirming that re-running the classification procedure yields identical counts. The RHAE values are computed directly from the agent logs already linked in the code repository. revision: yes
Referee: [Speed--Depth trade-off framework (formalisation section)] The quadratic RHAE form is stated to emerge from the Speed-Depth trade-off under the convexity assumption (proved for a class of environments in the Appendix); without the appendix proof details, it is unclear whether the form is independently derived or tied to fitted quantities, affecting the formal contribution.

Authors: The appendix already contains the proof that convexity implies the quadratic penalty term for deviation from the Pareto frontier. To address the concern we will (a) move the key derivation steps into the main formalisation section and (b) expand the appendix with the full inductive argument and an explicit statement that the quadratic form follows from the convexity assumption alone, without reference to any fitted parameters. This clarifies the independent mathematical origin of the result. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical enumeration and appendix-supported derivation are independent of inputs

full rationale

The manuscript's core claims rest on direct enumeration of 25 specific game-solving strategies (single blind step, probing action, repeated ACTION1, etc.) plus performance comparison of AERA against random/no-explore baselines. The Speed-Depth formalization states that RHAE's quadratic form emerges under a convexity assumption proved in the Appendix for a class of environments; this is presented as a derived second-order penalty rather than a fit or self-definition. No self-citations, fitted parameters renamed as predictions, or ansatz smuggling are present. The benchmark critique is grounded in observable game mechanics, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract introduces one explicit assumption required for the theoretical claim.

axioms (1)

domain assumption convexity assumption for a class of environments
Invoked to establish that RHAE takes a quadratic form as a second-order penalty away from the action-efficiency versus information-gain Pareto frontier.

pith-pipeline@v0.9.1-grok · 5825 in / 1268 out tokens · 49696 ms · 2026-06-29T21:57:01.684042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 7 internal anchors

[1]

ARC Prize Foundation. (2026). ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence. arXiv:2603.24621

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Chollet, F. (2019). On the Measure of Intelligence.arXiv:1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Kaelbling, L.P., Littman, M.L., & Cassandra, A.R. (1998). Planning and Acting in Partially Observable Stochastic Domains.Artificial Intelligence, 101(1-2), 99–134

1998
[4]

MacKay, D.J.C. (1992). Information-Based Objective Functions for Active Data Selection. Neural Computation, 4(4), 590–604

1992
[5]

Settles, B. (2010). Active Learning Literature Survey.Univ. Wisconsin–Madison TR 1648

2010
[6]

Spaan, M.T.J. (2012). Partially Observable Markov Decision Processes.Reinforcement Learning: State of the Art, 387–414. 23

2012
[7]

Kadavath, S. et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Bramley, N.R., Dayan, P., Griffiths, T.L., & Lagnado, D.A. (2017). Formalizing Neurath’s Ship. Psychological Review, 124(3), 301–338

2017
[9]

Rule, J.S., Tenenbaum, J.B., & Piantadosi, S.T. (2020). The Child as Hacker.Trends in Cognitive Sciences, 24(11), 900–915

2020
[10]

& Griffiths, T.L

Tenenbaum, J.B. & Griffiths, T.L. (2001). Generalization, Similarity, and Bayesian Inference. Behavioral and Brain Sciences, 24(4), 629–640

2001
[11]

Friston, K. (2010). The free-energy principle: a unified brain theory?Nature Reviews Neuro- science, 11(2), 127–138

2010
[12]

Lake, B.M., Salakhutdinov, R., & Tenenbaum, J.B. (2015). Human-level concept learning through probabilistic program induction.Science, 350(6266), 1332–1338

2015
[13]

Jolicoeur-Martineau, A. (2025). Less is More: Recursive Reasoning with Tiny Networks. arXiv:2510.04871. ARC Prize 2025 Paper Award, 1st place

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Pourcel, J. et al. (2025). Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI. ARC Prize 2025 Paper Award, 2nd place

2025
[15]

Liao, I. et al. (2025). CompressARC: An MDL-Based Single-Puzzle Neural System for ARC. ARC Prize 2025 Paper Award, 3rd place

2025
[16]

Chollet, F., Knoop, M., Kamradt, G., & Landers, B. (2026). ARC Prize 2025: Technical Report. arXiv:2601.10904

work page arXiv 2026
[17]

Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.ICLR 2023.arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.NeurIPS 2023.arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Ellis, K. et al. (2021). DreamCoder: Bootstrapping Inductive Program Synthesis with Wake- Sleep Library Learning.PLDI 2021.arXiv:2006.08381

work page arXiv 2021
[21]

Tolman, E.C. (1948). Cognitive maps in rats and men.Psychological Review, 55(4), 189–208

1948
[22]

& Nadel, L

O’Keefe, J. & Nadel, L. (1978).The Hippocampus as a Cognitive Map. Oxford University Press. 24

1978

[1] [1]

ARC Prize Foundation. (2026). ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence. arXiv:2603.24621

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Chollet, F. (2019). On the Measure of Intelligence.arXiv:1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Kaelbling, L.P., Littman, M.L., & Cassandra, A.R. (1998). Planning and Acting in Partially Observable Stochastic Domains.Artificial Intelligence, 101(1-2), 99–134

1998

[4] [4]

MacKay, D.J.C. (1992). Information-Based Objective Functions for Active Data Selection. Neural Computation, 4(4), 590–604

1992

[5] [5]

Settles, B. (2010). Active Learning Literature Survey.Univ. Wisconsin–Madison TR 1648

2010

[6] [6]

Spaan, M.T.J. (2012). Partially Observable Markov Decision Processes.Reinforcement Learning: State of the Art, 387–414. 23

2012

[7] [7]

Kadavath, S. et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Bramley, N.R., Dayan, P., Griffiths, T.L., & Lagnado, D.A. (2017). Formalizing Neurath’s Ship. Psychological Review, 124(3), 301–338

2017

[9] [9]

Rule, J.S., Tenenbaum, J.B., & Piantadosi, S.T. (2020). The Child as Hacker.Trends in Cognitive Sciences, 24(11), 900–915

2020

[10] [10]

& Griffiths, T.L

Tenenbaum, J.B. & Griffiths, T.L. (2001). Generalization, Similarity, and Bayesian Inference. Behavioral and Brain Sciences, 24(4), 629–640

2001

[11] [11]

Friston, K. (2010). The free-energy principle: a unified brain theory?Nature Reviews Neuro- science, 11(2), 127–138

2010

[12] [12]

Lake, B.M., Salakhutdinov, R., & Tenenbaum, J.B. (2015). Human-level concept learning through probabilistic program induction.Science, 350(6266), 1332–1338

2015

[13] [13]

Jolicoeur-Martineau, A. (2025). Less is More: Recursive Reasoning with Tiny Networks. arXiv:2510.04871. ARC Prize 2025 Paper Award, 1st place

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Pourcel, J. et al. (2025). Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI. ARC Prize 2025 Paper Award, 2nd place

2025

[15] [15]

Liao, I. et al. (2025). CompressARC: An MDL-Based Single-Puzzle Neural System for ARC. ARC Prize 2025 Paper Award, 3rd place

2025

[16] [16]

Chollet, F., Knoop, M., Kamradt, G., & Landers, B. (2026). ARC Prize 2025: Technical Report. arXiv:2601.10904

work page arXiv 2026

[17] [17]

Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.ICLR 2023.arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.NeurIPS 2023.arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Ellis, K. et al. (2021). DreamCoder: Bootstrapping Inductive Program Synthesis with Wake- Sleep Library Learning.PLDI 2021.arXiv:2006.08381

work page arXiv 2021

[21] [21]

Tolman, E.C. (1948). Cognitive maps in rats and men.Psychological Review, 55(4), 189–208

1948

[22] [22]

& Nadel, L

O’Keefe, J. & Nadel, L. (1978).The Hippocampus as a Cognitive Map. Oxford University Press. 24

1978