Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

Johan Land

arxiv: 2606.31543 · v1 · pith:FCO5YZ3Qnew · submitted 2026-06-30 · 💻 cs.AI · cs.CL· cs.LG

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

Johan Land This is my paper

Pith reviewed 2026-07-01 05:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords ARC-AGIreasoning tracesmultimodal searchholistic judgingabstract reasoningLLM solverminority hypothesissearch operators

0 comments

The pith

Generating reasoning traces independently across text, image and code then judging them together in one prompt solves ARC-AGI-2 at 72.9 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often generate fluent but incorrect reasoning traces for abstract visual tasks, so the bottleneck is selecting the right hypothesis rather than producing more output. The paper treats text, image, and code as separate search operators that produce candidates in parallel, then feeds every trace into a single long-context judge prompt that compares them jointly. This holistic approach recovers correct answers even when they are outnumbered by wrong ones, unlike majority voting. On the ARC Prize semi-private set the method reaches 72.9 percent accuracy at roughly 39 dollars per task, exceeding the best standalone frontier models by 18.7 points. The work also reports that standard techniques such as prescriptive templates and iterative refinement reduce diversity and lower scores.

Core claim

By treating reasoning modalities as independent search operators that generate diverse candidates across text, image, and code channels and then jointly comparing all traces inside a single long-context prompt, the solver reliably identifies correct minority hypotheses on ARC-AGI-2 tasks where the modal answer is wrong.

What carries the argument

Modality-driven search that produces independent candidates across channels combined with context-preserving holistic judging in one prompt.

If this is right

Correct solutions can be recovered even when they form a minority of generated traces.
The method exceeds the accuracy of the strongest standalone frontier models by 18.7 percentage points on the semi-private set.
Prescriptive prompting templates and iterative refinement reduce hypothesis diversity and degrade final performance.
Full source code is released along with documentation of the negative results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of generation channels and joint judging could be tested on other few-shot reasoning benchmarks where diversity is known to matter.
Increasing the number of candidates per modality while keeping the judge prompt fixed might raise accuracy until context limits are reached.
The reported cost per task suggests that selective use of the holistic judge only on high-uncertainty tasks could lower average expense.

Load-bearing premise

A single long-context judge prompt can reliably pick the correct minority hypothesis even when most traces are wrong, without its own ordering or length biases.

What would settle it

Measure accuracy on the same tasks when the holistic judge is replaced by majority vote over the same set of traces and check whether the gap appears mainly on tasks where the most common answer is incorrect.

Figures

Figures reproduced from arXiv: 2606.31543 by Johan Land.

**Figure 1.** Figure 1: ARC-AGI-2 task 3dc255db. A human might see “spaceships” with particles on the exhaust side. The transformation removes the particles from the interior and extends them from the nose. Three training pairs (rows 1–3) demonstrate the rule; the test input (row 4) must be solved from these examples alone. This task remains unsolved. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Solver pipeline: candidate generation with adaptive early stopping, holistic judging, and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Example image rendering used for image-based prompting (task d35bdbdc:1). [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Methodology matrix over public evaluation instances. Green = candidate matches ground [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I present a solver for ARC-AGI-2, a few-shot visual reasoning benchmark, built around two principles: (i) treating reasoning modalities as search operators, generating diverse candidates independently across text, image, and code channels, and (ii) context-preserving holistic judging, in which a judge model jointly compares all candidate reasoning traces within a single long-context prompt. Unlike self-consistency or majority voting, this approach reliably recovers correct minority hypotheses on tasks where the modal answer is wrong. On the ARC Prize semi-private evaluation set, the solver achieves 72.9 percent at USD 38.99 per task - the highest score on the verified leaderboard at the time of writing, exceeding the best standalone frontier models, GPT-5.2 Pro at 54.2 percent and Gemini 3 Pro at 54.0 percent, by +18.7 percentage points. On the public evaluation set, it achieves 76.1 percent at USD 19.69 per task. I release the full source code and document extensive negative results, including the finding that prescriptive prompting templates and iterative refinement systematically reduce hypothesis diversity and degrade performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gets a big reported lift on ARC-AGI-2 by generating traces across text/image/code then using one long-context judge to pick among them, with code released, but the writeup gives almost no controls or stats to back the judge's reliability.

read the letter

The paper describes a solver for ARC-AGI-2 that generates candidate reasoning traces separately in text, image, and code modalities, then feeds all of them into one long-context prompt for a judge model to select the best one. It reports 72.9 percent on the semi-private set, beating the top frontier models by nearly 19 points, and 76.1 percent on the public set, with code released and some negative results documented.

The new part is the explicit framing of modalities as independent search operators combined with holistic judging in a single prompt, rather than majority vote or iterative refinement. Releasing the code and showing that prescriptive templates reduce diversity is useful for others who want to build on or avoid those pitfalls.

The main weakness is the absence of any reported details on statistical significance, multiple runs, task sampling, or tests for judge-specific artifacts such as position bias or recency effects in the long context. The performance jump depends on the judge correctly identifying minority-correct traces, yet the abstract provides no evidence that this holds after controlling for known long-context issues. If those controls are missing in the full paper as well, the +18 point gain could be overstated.

This work is for researchers and engineers focused on improving performance on abstract reasoning benchmarks like ARC. Someone looking for practical implementation ideas or documented failures will find it worth reading. It should go to peer review because the concrete leaderboard numbers and open code make it possible for referees to check the claims directly, even if the current presentation leaves the judge mechanism under-examined.

Referee Report

3 major / 2 minor

Summary. The paper introduces a solver for the ARC-AGI-2 benchmark that generates candidate reasoning traces independently across text, image, and code modalities and then uses a single long-context holistic judge prompt to select among them. It claims this approach recovers correct minority hypotheses more reliably than majority voting or self-consistency, achieving 72.9% on the ARC Prize semi-private evaluation set (at $38.99 per task) and 76.1% on the public set (at $19.69 per task), outperforming standalone frontier models by +18.7 pp. The manuscript releases full source code and documents negative results on prescriptive templates and iterative refinement.

Significance. If the reported performance holds under rigorous verification, the work would demonstrate that explicit multi-modality search combined with joint long-context judging can produce substantial gains on abstract visual reasoning tasks where single-model generation fails. The public release of code and the documentation of negative results on common prompting practices are clear strengths that support reproducibility and community progress.

major comments (3)

[Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central performance claims of 72.9% (semi-private) and 76.1% (public) are presented without reported task counts, sampling protocol, statistical significance tests, or error bars. Because the +18.7 pp margin over GPT-5.2 Pro and Gemini 3 Pro is the primary empirical result, the absence of these details prevents assessment of whether the improvement is robust or could be explained by evaluation variance.
[§3.2 (Holistic Trace Judging)] §3.2 (Holistic Trace Judging): The claim that a single long-context judge reliably recovers correct minority hypotheses rests on the assumption that the judge is free of systematic artifacts. No ablation studies, position-permutation controls, or recency-bias measurements are described, leaving open the possibility that observed gains are partly attributable to judge ordering effects rather than the modality-search principle.
[§4.3 (Cost and Leaderboard Comparison)] §4.3 (Cost and Leaderboard Comparison): The per-task costs ($38.99 and $19.69) and leaderboard ranking are load-bearing for the practical contribution, yet no breakdown of token usage by modality or judge calls is supplied, nor is there verification that the semi-private score was obtained under the official ARC Prize protocol.

minor comments (2)

[Figure 2 and §3.1] Figure 2 and §3.1: The description of how image and code modalities are rendered into the shared context would benefit from an explicit example showing the exact prompt template used for each modality.
[§5 (Negative Results)] §5 (Negative Results): The finding that prescriptive templates reduce diversity is valuable but would be strengthened by quantitative metrics (e.g., entropy of generated answers) rather than qualitative description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing empirical rigor, controls for artifacts, and reproducibility. We respond to each major comment below and outline planned revisions.

read point-by-point responses

Referee: Abstract and §4 (Evaluation): The central performance claims of 72.9% (semi-private) and 76.1% (public) are presented without reported task counts, sampling protocol, statistical significance tests, or error bars. Because the +18.7 pp margin over GPT-5.2 Pro and Gemini 3 Pro is the primary empirical result, the absence of these details prevents assessment of whether the improvement is robust or could be explained by evaluation variance.

Authors: We agree that these details are required to evaluate robustness. The revised manuscript will explicitly report the task counts for the semi-private and public sets, provide a complete description of the evaluation protocol (including the fixed few-shot setup with no stochastic sampling in the solver), and include error bars derived from bootstrap methods along with statistical comparisons against the baseline models where appropriate. revision: yes
Referee: §3.2 (Holistic Trace Judging): The claim that a single long-context judge reliably recovers correct minority hypotheses rests on the assumption that the judge is free of systematic artifacts. No ablation studies, position-permutation controls, or recency-bias measurements are described, leaving open the possibility that observed gains are partly attributable to judge ordering effects rather than the modality-search principle.

Authors: This concern about potential judge artifacts is valid. The original manuscript contains no position-permutation ablations or recency-bias measurements. We will revise §3.2 to acknowledge this limitation explicitly, discuss the assumption of judge neutrality, and note that the primary source of performance gains is the independent modality-driven generation of diverse hypotheses. The public code release enables community verification of these effects. revision: partial
Referee: §4.3 (Cost and Leaderboard Comparison): The per-task costs ($38.99 and $19.69) and leaderboard ranking are load-bearing for the practical contribution, yet no breakdown of token usage by modality or judge calls is supplied, nor is there verification that the semi-private score was obtained under the official ARC Prize protocol.

Authors: We will expand §4.3 to include a breakdown of token usage separated by modality generation and judge calls. The semi-private score was obtained by following the official ARC Prize submission and verification protocol, as confirmed by its placement on the verified leaderboard; we will add an explicit statement to this effect in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system evaluated on external benchmarks

full rationale

The paper presents an empirical engineering solver for ARC-AGI-2 relying on modality-driven candidate generation and long-context holistic judging. All performance claims (72.9% semi-private, 76.1% public) are grounded in external benchmark evaluation against independent models (GPT-5.2 Pro, Gemini 3 Pro) rather than any internal derivation, equations, fitted parameters, or self-referential predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes appear; the work is self-contained against the ARC benchmark and reports negative results on alternative approaches.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated. The method relies on the empirical behavior of existing frontier models as black-box generators and judges.

pith-pipeline@v0.9.1-grok · 5761 in / 1295 out tokens · 27799 ms · 2026-07-01T05:30:48.103887+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 1 internal anchor

[1]

LLM-as-a-judge

o introduce a key trade-off: iterative refinement can increase compute while riskinganchoringto early hypotheses—an issue that becomes acute on ARC tasks where early commitments can be misleading. 2.8 Selection, verification, and “LLM-as-a-judge” paradigms Generating diverse candidates is only half the problem; the other half is selecting among them. In t...

2023
[2]

the most common solution

used not just for execution but also as a way to produce richer intermediate artifacts that can be consumed by downstream selection. 3.Judge-based selection[LLM-as-a-judge; Zheng et al. (2023)] adapted from general LLM evaluation into an in-task meta-reasoning component, with the additional complication that ARC demands judginggeneralization under undersp...

2023
[3]

No debiasing is applied in the current implementation

and could interact with the modality mix, since code candidates tend to produce structured traces while text candidates produce prose. No debiasing is applied in the current implementation. 5.7 Feasibility and context length The holistic judging prompt is intentionally large: on the order of30k–80k input tokens, because it includes full traces from many c...

2024
[4]

The leaderboard snapshot in Table 2 reflects the state at the time of announcement; subsequent entries (discussed below) have since been added. 6.3.2 Semi-private evaluation (ARC Prize Verified / leaderboard) My solver achieves: •72.9% solvedon ARC-AGI-2 semi-private eval (≈73%) at$38.99/task— the highest score on the ARC Prize Verified leaderboard at the...

2024
[5]

Leaderboard snapshot and reference systems.9 Semi-private results are as reported on the ARC Prize Verified leaderboard at the time of the official results announcement (February 3, 2026); the public-evaluation row is self-measured on the public evaluation split. 9https://arcprize.org/leaderboard 18 AI System Author ARC-AGI-2 Cost/Task Comment Human Panel...

2026
[6]

diverse generation + holistic judging

have already narrowed the gap between single-model baselines and ensemble approaches like the one described here, at a fraction of the cost. This trajectory is expected to continue. As base models grow stronger, the marginal value of any fixed ensemble architecture will shift: the same “diverse generation + holistic judging” pattern applied to stronger ba...

2024
[7]

Phase Total ($) Avg $/instance % of total Candidate generation 2081.37 12.46 87.1% Judging 308.91 1.85 12.9% Total 2390.28 14.31 100% Table

Cost attribution per test instance on the public evaluation run (n = 167). Phase Total ($) Avg $/instance % of total Candidate generation 2081.37 12.46 87.1% Judging 308.91 1.85 12.9% Total 2390.28 14.31 100% Table

2081
[8]

Step 1: Identify the pattern. Step 2: Describe the rule. Step 3: Apply to test input

and Reflexion (Shinn et al. 2023), where an initial pass generates feedback that informs a subsequent attempt. Motivation: - doubling the reasoning budget across two turns (hint then solve). Observed drawback: - the hint stage oftenlimits creativityand collapses candidate diversity into a narrower space, which is counterproductive when trying to break new...

2023
[9]

reasoning settings

that provides only the task data, a brief context sentence, and a request to explain reasoning — with no prescribed structure, no step-by-step template, and no domain heuristics. The mechanism appears to be acompliance tax on reasoning: when the model is given detailed instructions abouthowto think, it allocates a significant portion of its reasoning budg...

2024
[10]

On the Measure of Intelligence

“On the Measure of Intelligence.”arXiv Preprint arXiv:1911.01547. Chollet, François, Mike Knoop, Gregory Kamradt, and Bryan Landers

work page internal anchor Pith review Pith/arXiv arXiv 1911
[11]

Chollet, M

“ARC Prize 2024: Technical Report.”arXiv Preprint arXiv:2412.04604. Ferré, Sébastien

work page arXiv 2024
[12]

First Steps of an Approach to the ARC Challenge Based on Descriptive GridModelsandtheMinimumDescriptionLengthPrinciple

“First Steps of an Approach to the ARC Challenge Based on Descriptive GridModelsandtheMinimumDescriptionLengthPrinciple.”arXiv Preprint arXiv:2112.00848. Fletcher-Hill, Paul

work page arXiv
[13]

arXiv preprint arXiv:2404.07353 , year=

“Addressing the Abstraction and Reasoning Corpus via Procedural Example Generation.”arXiv Preprint arXiv:2404.07353. 36 Li, Wen-Ding, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, et al

work page arXiv
[14]

ARC-GEN: A Mimetic Procedural Benchmark Generator for the Ab- straction and Reasoning Corpus

“ARC-GEN: A Mimetic Procedural Benchmark Generator for the Ab- straction and Reasoning Corpus.”arXiv Preprint arXiv:2511.00162. Moskvichev, Arseny, Victor Vikram Odouard, and Melanie Mitchell

work page arXiv
[15]

Towards Efficient Neurally-Guided Program Induction for ARC-AGI

“Towards Efficient Neurally-Guided Program Induction for ARC-AGI.” arXiv Preprint arXiv:2411.17708. Puget, Jean-François

work page arXiv

[1] [1]

LLM-as-a-judge

o introduce a key trade-off: iterative refinement can increase compute while riskinganchoringto early hypotheses—an issue that becomes acute on ARC tasks where early commitments can be misleading. 2.8 Selection, verification, and “LLM-as-a-judge” paradigms Generating diverse candidates is only half the problem; the other half is selecting among them. In t...

2023

[2] [2]

the most common solution

used not just for execution but also as a way to produce richer intermediate artifacts that can be consumed by downstream selection. 3.Judge-based selection[LLM-as-a-judge; Zheng et al. (2023)] adapted from general LLM evaluation into an in-task meta-reasoning component, with the additional complication that ARC demands judginggeneralization under undersp...

2023

[3] [3]

No debiasing is applied in the current implementation

and could interact with the modality mix, since code candidates tend to produce structured traces while text candidates produce prose. No debiasing is applied in the current implementation. 5.7 Feasibility and context length The holistic judging prompt is intentionally large: on the order of30k–80k input tokens, because it includes full traces from many c...

2024

[4] [4]

The leaderboard snapshot in Table 2 reflects the state at the time of announcement; subsequent entries (discussed below) have since been added. 6.3.2 Semi-private evaluation (ARC Prize Verified / leaderboard) My solver achieves: •72.9% solvedon ARC-AGI-2 semi-private eval (≈73%) at$38.99/task— the highest score on the ARC Prize Verified leaderboard at the...

2024

[5] [5]

Leaderboard snapshot and reference systems.9 Semi-private results are as reported on the ARC Prize Verified leaderboard at the time of the official results announcement (February 3, 2026); the public-evaluation row is self-measured on the public evaluation split. 9https://arcprize.org/leaderboard 18 AI System Author ARC-AGI-2 Cost/Task Comment Human Panel...

2026

[6] [6]

diverse generation + holistic judging

have already narrowed the gap between single-model baselines and ensemble approaches like the one described here, at a fraction of the cost. This trajectory is expected to continue. As base models grow stronger, the marginal value of any fixed ensemble architecture will shift: the same “diverse generation + holistic judging” pattern applied to stronger ba...

2024

[7] [7]

Phase Total ($) Avg $/instance % of total Candidate generation 2081.37 12.46 87.1% Judging 308.91 1.85 12.9% Total 2390.28 14.31 100% Table

Cost attribution per test instance on the public evaluation run (n = 167). Phase Total ($) Avg $/instance % of total Candidate generation 2081.37 12.46 87.1% Judging 308.91 1.85 12.9% Total 2390.28 14.31 100% Table

2081

[8] [8]

Step 1: Identify the pattern. Step 2: Describe the rule. Step 3: Apply to test input

and Reflexion (Shinn et al. 2023), where an initial pass generates feedback that informs a subsequent attempt. Motivation: - doubling the reasoning budget across two turns (hint then solve). Observed drawback: - the hint stage oftenlimits creativityand collapses candidate diversity into a narrower space, which is counterproductive when trying to break new...

2023

[9] [9]

reasoning settings

that provides only the task data, a brief context sentence, and a request to explain reasoning — with no prescribed structure, no step-by-step template, and no domain heuristics. The mechanism appears to be acompliance tax on reasoning: when the model is given detailed instructions abouthowto think, it allocates a significant portion of its reasoning budg...

2024

[10] [10]

On the Measure of Intelligence

“On the Measure of Intelligence.”arXiv Preprint arXiv:1911.01547. Chollet, François, Mike Knoop, Gregory Kamradt, and Bryan Landers

work page internal anchor Pith review Pith/arXiv arXiv 1911

[11] [11]

Chollet, M

“ARC Prize 2024: Technical Report.”arXiv Preprint arXiv:2412.04604. Ferré, Sébastien

work page arXiv 2024

[12] [12]

First Steps of an Approach to the ARC Challenge Based on Descriptive GridModelsandtheMinimumDescriptionLengthPrinciple

“First Steps of an Approach to the ARC Challenge Based on Descriptive GridModelsandtheMinimumDescriptionLengthPrinciple.”arXiv Preprint arXiv:2112.00848. Fletcher-Hill, Paul

work page arXiv

[13] [13]

arXiv preprint arXiv:2404.07353 , year=

“Addressing the Abstraction and Reasoning Corpus via Procedural Example Generation.”arXiv Preprint arXiv:2404.07353. 36 Li, Wen-Ding, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, et al

work page arXiv

[14] [14]

ARC-GEN: A Mimetic Procedural Benchmark Generator for the Ab- straction and Reasoning Corpus

“ARC-GEN: A Mimetic Procedural Benchmark Generator for the Ab- straction and Reasoning Corpus.”arXiv Preprint arXiv:2511.00162. Moskvichev, Arseny, Victor Vikram Odouard, and Melanie Mitchell

work page arXiv

[15] [15]

Towards Efficient Neurally-Guided Program Induction for ARC-AGI

“Towards Efficient Neurally-Guided Program Induction for ARC-AGI.” arXiv Preprint arXiv:2411.17708. Puget, Jean-François

work page arXiv