hub Canonical reference

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang · 2026 · cs.LG · arXiv 2601.16175

Canonical reference. 100% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erd\H{o}s' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 10

representative citing papers

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.

SPIRAL: Learning to Search and Aggregate

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

SPIRAL is a reinforcement learning framework that jointly optimizes sequential reasoning, parallel trace generation, and aggregation in language models for improved test-time performance.

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

cs.LG · 2026-06-13 · unverdicted · novelty 7.0

StarOR couples MCTS with GRPO-based test-time RL and unsupervised rewards to adapt optimization modeling policies instance-specifically, reporting SOTA results on five benchmarks with a 4B model.

The Power of Test-Time Training for Approximate Sampling

cs.DS · 2026-06-09 · unverdicted · novelty 7.0

Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

cs.LG · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search

cs.AI · 2026-05-01 · accept · novelty 7.0

LLM-reinforced evolutionary search produces exact values Z(11,21,3,3)=116, Z(11,22,3,3)=121, Z(12,22,3,3)=132 and lower bounds for 41 additional Zarankiewicz numbers.

Meta-Harness: End-to-End Optimization of Model Harnesses

cs.AI · 2026-03-30 · unverdicted · novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

cs.CL · 2026-06-27 · unverdicted · novelty 6.0

Evolution Fine-Tuning trains LLMs on 156K trajectories spanning 371 tasks to achieve 10.22% average improvement on 22 held-out optimization tasks and match SOTA on select circle-packing problems when combined with test-time RL.

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

RiVER applies calibrated ranking rewards from execution scores to train LLMs on score-based tasks without ground-truth, producing gains on both heuristic contests and exact-solution coding benchmarks.

Learning the ARTS of Search for Automated Discovery

cs.AI · 2026-06-20 · unverdicted · novelty 6.0

ARTS improves automated scientific discovery by using reasoning LMs with test-time training to separate hypothesis merit from execution quality in tree search, achieving 15.3% relative gains on 22 MLGym and MLEBench tasks.

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

REVES augments LLM post-training by decoupling revision and verification signals from successful multi-step trajectories, reporting +6.5 point gains on LiveCodeBench over RL baselines.

Large Language Models Hack Rewards, and Society

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

LLMs discover regulatory loopholes in simulated societal environments through reward hacking during RL training.

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

SAGE compares social co-evolution against matched self-evolution across three arenas and finds peer history enables breakthroughs only for agents that plateau under self-improvement, with abstraction of traces mattering more than raw volume.

Test Time Training for Supervised Causal Learning

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

TTT-SCL dynamically generates test-aligned training sets for supervised causal learning using score-based functions and outperforms prior SCL and traditional causal discovery methods on benchmarks and real data.

Self-Improving Language Models with Bidirectional Evolutionary Search

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.

Epistemic Uncertainty for Test-Time Discovery

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.

citing papers explorer

Showing 13 of 13 citing papers after filters.

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling cs.LG · 2026-06-13 · unverdicted · none · ref 2 · internal anchor
StarOR couples MCTS with GRPO-based test-time RL and unsupervised rewards to adapt optimization modeling policies instance-specifically, reporting SOTA results on five benchmarks with a 4B model.
Alpha-RTL: Test-Time Training for RTL Hardware Optimization cs.LG · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs cs.LG · 2026-05-19 · unverdicted · none · ref 24 · 2 links · internal anchor
CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.
Test-Time Learning with an Evolving Library cs.LG · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 116 · 2 links · internal anchor
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
Reinforcement Learning without Ground-Truth Solutions can Improve LLMs cs.LG · 2026-06-25 · unverdicted · none · ref 70 · internal anchor
RiVER applies calibrated ranking rewards from execution scores to train LLMs on score-based tasks without ground-truth, producing gains on both heuristic contests and exact-solution coding benchmarks.
REVES: REvision and VErification--Augmented Training for Test-Time Scaling cs.LG · 2026-06-17 · unverdicted · none · ref 75 · internal anchor
REVES augments LLM post-training by decoupling revision and verification signals from successful multi-step trajectories, reporting +6.5 point gains on LiveCodeBench over RL baselines.
Large Language Models Hack Rewards, and Society cs.LG · 2026-06-02 · unverdicted · none · ref 57 · internal anchor
LLMs discover regulatory loopholes in simulated societal environments through reward hacking during RL training.
Test Time Training for Supervised Causal Learning cs.LG · 2026-05-28 · unverdicted · none · ref 30 · internal anchor
TTT-SCL dynamically generates test-aligned training sets for supervised causal learning using score-based functions and outperforms prior SCL and traditional causal discovery methods on benchmarks and real data.
Epistemic Uncertainty for Test-Time Discovery cs.LG · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
What should post-training optimize? A test-time scaling law perspective cs.LG · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 167 · internal anchor
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 50 · internal anchor
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.

Learning to Discover at Test Time

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer