hub Canonical reference

Arc prize 2024: Technical report

Arc prize · 2024 · arXiv 2412.04604

Canonical reference. 100% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Knowledge Index of Noah's Ark

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

Factorization Regret mediates compositional generalization in latent space

cs.LG · 2026-03-28 · unverdicted · novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

A modality-driven search system with holistic trace judging for ARC-AGI-2 reaches 72.9% on the semi-private set and 76.1% on the public set, outperforming GPT-5.2 Pro and Gemini 3 Pro by 18.7 points while releasing full code.

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

FPRM is a Transformer-based model using fixed-point convergence for adaptive halting in looped architectures, claimed effective on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.

Slots, Transitions, Loops: Learning Composable World Models for ARC

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Loop-OWM uses color-prototype slots, demonstration-conditioned task summaries, and looped transitions to model ARC rules as visual-symbolic state changes and outperforms baselines on ARC-1 and ARC-2.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

cs.AI · 2026-05-31 · conditional · novelty 6.0

LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

cs.AI · 2025-10-09 · unverdicted · novelty 6.0

Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.

Artificial Phantasia: Emergent Mental Imagery in Large Language Models

cs.AI · 2025-09-27 · unverdicted · novelty 6.0

LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

cs.AI · 2025-05-17 · unverdicted · novelty 6.0

ARC-AGI-2 adds a larger, more complex set of tasks to the original ARC-AGI benchmark to give finer-grained measurement of fluid intelligence in AI.

Language-Guided Abstraction for Visual Reasoning

cs.CV · 2026-06-11 · unverdicted · novelty 5.0

L-VARC is a LUPI framework that refines crowd-sourced language descriptions with an LLM and uses cross-attention to guide visual ARC models during training only, yielding SOTA results with a lightweight 18M-parameter network.

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

cs.AI · 2026-05-11 · unverdicted · novelty 5.0

The authors propose creating data probes—synthetic sequences from defined random processes—to reveal how data properties drive LLM behavior across workflow stages.

Beyond Tools and Persons: Who Are They? Classifying Robots and AI Agents for Proportional Governance

cs.ET · 2026-04-07 · unverdicted · novelty 5.0

A CPST-based taxonomy sorts autonomous systems into Confined Actors, Socially-Aware Interactors, and CPST-Integrated Agents to enable proportional governance from enhanced liability to qualified personhood.

Hierarchical Reasoning Model

cs.AI · 2025-06-26 · unverdicted · novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

A Compositional Framework for Open-ended Intelligence

cs.LG · 2026-06-13 · unverdicted · novelty 4.0

Open-ended intelligence is formalized as the compositional closure L(P,C) of primitives P under operators C, with next primitive prediction proposed as an objective to acquire reusable primitives and grammar for lifelong adaptation.

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

cs.CL · 2026-06-13 · unverdicted · novelty 4.0

Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.

Customizing an LLM for Enterprise Software Engineering

cs.SE · 2026-05-15 · unverdicted · novelty 4.0 · 2 refs

Gemini for Google, customized via continued pre-training on proprietary Google engineering data, delivers measurable productivity gains in a large internal developer study.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

citing papers explorer

Showing 2 of 2 citing papers after filters.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 12 · 3 links
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 135
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Arc prize 2024: Technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer