arxiv: 2604.19341 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

Evaluation-driven Scaling for Scientific Discovery

Caiyin Yang, Chang Su, Chong Gao, Dachao Ding, Di He, Guangrong He, Haotian Ye, Haowei Lin, James Zou, Jianzhu Ma, Jingyi Tang, Lina Sun, Miaolei Zhang, Rahul Thapa, Ruihua Liu, Rui Yang, Stefano Ermon, Tongyang Li, Wenyang Wang, Xiaowen Chu, Yizhen Luo, Yuchen Zhong, Yuzhi Xu, Zeyu Li, Zhuohao Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM scientific discoverytest-time scalingevaluation-driven refinementSimpleTESalgorithm optimizationquantum circuit routingErdos constructionstrajectory supervision

0 comments

The pith

Simple test-time scaling of evaluation loops lets open models discover better scientific solutions than frontier systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that evaluation-driven discovery loops can be scaled effectively by combining parallel exploration of candidate solutions, iterative refinement guided by feedback, and local selection among high-scoring options. This matters because it reframes the limits of LLM use in science as a question of how evaluation is orchestrated at inference time rather than solely a matter of model size or pre-training. SimpleTES implements this scaling in a lightweight way and shows consistent gains across 21 problems in six domains, including concrete advances such as faster LASSO implementations, lower-overhead quantum routing, and improved Erdos constructions. The same process also yields trajectory data that can supervise post-training for improved efficiency and generalization to new problems.

Core claim

SimpleTES scales evaluation-driven discovery by strategically combining parallel exploration, feedback-driven refinement, and local selection; when applied to gpt-oss models it produces state-of-the-art solutions on 21 scientific problems spanning six domains, outperforming both frontier-model baselines and sophisticated optimization pipelines while also generating successful trajectories that improve subsequent model performance through post-training.

What carries the argument

Simple Test-time Evaluation-driven Scaling (SimpleTES), a framework that amplifies the impact of verifiers, simulators, or scoring functions by running many candidates in parallel, refining them based on evaluation feedback, and selecting locally optimal trajectories.

If this is right

Evaluation scaling becomes a practical axis for advancing LLM-driven discovery independent of further pre-training gains.
Trajectory histories collected during successful discoveries can be reused to post-train models that solve both seen and unseen problems more efficiently.
Open-weight models equipped with this loop can surpass closed frontier models on concrete algorithmic and combinatorial tasks.
Specific domains see measurable gains such as more than 2x speedup for LASSO, 24.5% reduction in quantum gate overhead, and new record constructions for the Erdos minimum overlap problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reliable verifiers can be engineered for additional domains, the same scaling approach may accelerate discovery in fields currently limited by weak feedback signals.
The emphasis on test-time loops suggests that future model development could prioritize training for effective use of external evaluation rather than solely increasing raw capability.
Post-training on discovery trajectories may create a feedback cycle where each round of scaling produces data that makes the next round more effective.

Load-bearing premise

Reliable, unbiased verifiers or task-specific scoring functions exist for the problems and can steer refinement without systematic errors or hidden constraints on the search space.

What would settle it

Running SimpleTES on a fresh scientific problem whose verifier is known to be noisy or biased produces no improvement over direct prompting or yields solutions that fail independent verification while a non-scaled baseline succeeds.

read the original abstract

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SimpleTES gives a workable recipe for scaling evaluation loops in LLM science tasks, but the SOTA claims depend on unvalidated problem-specific scorers.

read the letter

The paper introduces SimpleTES as a straightforward combination of parallel sampling, feedback refinement, and local selection to run more evaluations per discovery step. It applies the loop to 21 problems in six domains and reports concrete wins: over 2x speedup on LASSO, 24.5% lower gate overhead on quantum routing, and new Erdos minimum-overlap constructions that beat prior records. The framework also logs successful trajectories and uses them for post-training, which improves efficiency on both seen and unseen problems.

Referee Report

2 major / 2 minor

Summary. The paper introduces Simple Test-time Evaluation-driven Scaling (SimpleTES), a framework that combines parallel exploration, feedback-driven refinement, and local selection to scale evaluation-driven discovery loops using LLMs. It reports that SimpleTES, applied with gpt-oss models, achieves state-of-the-art results on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines. Specific claims include a >2x speedup on the LASSO algorithm, a 24.5% reduction in gate overhead for quantum circuit routing, and new Erdos minimum-overlap constructions that surpass prior best-known results. The work also shows that successful discovery trajectories can supervise post-training, improving efficiency on seen problems and enabling generalization to unseen ones.

Significance. If the empirical results hold under properly validated evaluators, the work would be significant for LLM-driven scientific discovery by framing evaluation scaling as a central, actionable axis and offering a simple, practical framework. The trajectory-based post-training component is a notable strength, as it converts discovery histories into reusable supervision signals that demonstrably improve both in-domain efficiency and out-of-domain generalization. These elements could influence how future systems integrate verifiers and simulators into iterative loops.

major comments (2)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): The SOTA claims, including the new Erdos minimum-overlap constructions, >2x LASSO speedup, and 24.5% quantum-routing improvement, rest on problem-specific verifiers and scoring functions. No independent validation, cross-checks against external oracles or full literature baselines, or ablations on scorer fidelity/approximation error are reported. This is load-bearing for the central empirical claims, as any systematic bias or incompleteness in the scorers would turn reported improvements into artifacts.
[§4] §4 (Experimental Setup): No details are provided on experimental controls, statistical tests for significance, variance across LLM sampling runs, or exact baseline implementations (e.g., how frontier-model and optimization-pipeline comparisons were configured). Without these, it is impossible to determine whether the consistent outperformance is robust or attributable to SimpleTES rather than implementation choices or stochastic effects.

minor comments (2)

[Abstract and §3] Abstract and §3: The term 'gpt-oss models' is used without an explicit definition or list of the specific models and versions employed; this should be clarified for reproducibility.
[§5] §5: Trajectory-level histories are described as naturally supervising feedback-driven learning, but the post-training protocol (data filtering, loss formulation, training hyperparameters) receives only high-level treatment; a dedicated subsection or appendix would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The SOTA claims, including the new Erdos minimum-overlap constructions, >2x LASSO speedup, and 24.5% quantum-routing improvement, rest on problem-specific verifiers and scoring functions. No independent validation, cross-checks against external oracles or full literature baselines, or ablations on scorer fidelity/approximation error are reported. This is load-bearing for the central empirical claims, as any systematic bias or incompleteness in the scorers would turn reported improvements into artifacts.

Authors: We agree that the reliability of the reported improvements depends on the verifiers. For each of the 21 problems, the scoring functions are drawn from established, objective metrics in the respective literatures (e.g., wall-clock runtime via standard solvers for LASSO, exact gate-count simulation for quantum routing, and direct mathematical verification for Erdős constructions). In the revised manuscript we have added a new subsection to §4 that explicitly documents every verifier, its implementation, known limitations, and any approximation error bounds. For the new Erdős constructions we now include the explicit solutions together with a verification script in the supplementary material. We have also added a targeted ablation on scorer fidelity for the subset of problems that use approximate evaluators, confirming that the reported gains remain stable under reasonable perturbations. While we cannot perform external oracle validation within the scope of this work, the added documentation and code enable independent reproduction and checking. We maintain that the improvements are not artifacts because all methods (SimpleTES, frontier models, and optimization baselines) were evaluated under identical verifiers. revision: partial
Referee: [§4] §4 (Experimental Setup): No details are provided on experimental controls, statistical tests for significance, variance across LLM sampling runs, or exact baseline implementations (e.g., how frontier-model and optimization-pipeline comparisons were configured). Without these, it is impossible to determine whether the consistent outperformance is robust or attributable to SimpleTES rather than implementation choices or stochastic effects.

Authors: We acknowledge that the original §4 lacked sufficient experimental detail. In the revised version we have expanded §4 with the following additions: (1) precise configurations and prompt templates for all frontier-model and optimization-pipeline baselines; (2) the number of independent sampling runs (five per problem, different random seeds) together with reported standard deviations; (3) statistical significance testing via paired t-tests with p-values now shown in the result tables; and (4) fixed sampling hyperparameters (temperature, top-p, etc.) across all compared methods. These controls demonstrate that the observed gains are robust and attributable to the SimpleTES framework rather than implementation or stochastic variation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents SimpleTES as an empirical framework combining parallel exploration, feedback-driven refinement, and local selection, then reports performance gains on 21 problems against external baselines (LASSO runtime, quantum gate counts, Erdos overlap records). No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. Post-training on trajectories is described as an additional observed outcome rather than a definitional step. All central claims rest on comparisons to independent SOTA and baselines, making the work self-contained against external metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities can be identified. The framework relies on standard LLM generation and problem-specific evaluators whose details are not provided.

pith-pipeline@v0.9.0 · 5669 in / 1208 out tokens · 34336 ms · 2026-05-10T03:36:48.275377+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Reference graph

Works this paper leans on

188 extracted references · 85 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Accurate structure predic- tion of biomolecular interactions with alphafold 3

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ron- neberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure predic- tion of biomolecular interactions with alphafold 3. Nature, 630(8016):493–500, 2024

2024
[2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv...

work page internal anchor Pith review doi:10.48550/arxiv.2507.19457 2025
[3]

Natarajan, and K

Nasir Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE T ransactions on Computers, 23(1):90–93, 1974

1974
[4]

FARS: Fully automated research system

Analemma AI. FARS: Fully automated research system. https://analemma.ai/fars/, 2026

2026
[5]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Con- crete problems in ai safety , 2016. URL https://arxiv.org/abs/1606.06565

work page internal anchor Pith review arXiv 2016
[6]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https: //github.com/algorithmicsuperintelligence/openevolve. GitHub repository

2025
[7]

Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: an open source evolutionary coding agent for algorithmic discovery and optimization. arXiv preprint arXiv:2510.14150, 2025. doi: 10.48550/arXiv.2510.14150. URL https://arxiv.org/abs/2510.14150

work page doi:10.48550/arxiv.2510.14150 2025
[8]

Are multimodal LLMs robust against adversarial perturbations? RoMMath: A systematic evaluation on multimodal math reasoning

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. pages 6709–6738. Association for Computational Linguistics, 2025. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.342. URL https://aclanthology.org/2025.naacl-long.342/

work page doi:10.18653/v1/2025 2025
[9]

Barnard and Stefan Steinerberger

Richard C. Barnard and Stefan Steinerberger. Three convolution inequalities on the real line with connections to additive combinatorics. Journal of Number Theory , 207:42–55, 2020. ISSN 0022-314X. doi: https://doi.org/10.1016/j.jnt.2019.07.001. URL https://www.sciencedirect.com/ science/article/pii/S0022314X19302549

work page doi:10.1016/j.jnt.2019.07.001 2020
[10]

Molecular cross-validation for single-cell rna-seq

Joshua Batson, Loïc Royer, and James Webber. Molecular cross-validation for single-cell rna-seq. BioRxiv, page 786269, 2019

2019
[11]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, T omasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

2024
[12]

Aster: Autonomous scientific discovery over 20x faster than existing methods

Emmett Bicker. Aster: Autonomous scientific discovery over 20x faster than existing methods. arXiv preprint arXiv:2602.07040, 2026

work page arXiv 2026
[13]

A quantum processor based on coherent transport of entangled atom arrays

Dolev Bluvstein, Harry Levine, Giulia Semeghini, T out T Wang, Sepehr Ebadi, Marcin Kalinowski, Alexander Keesling, Nishad Maskara, Hannes Pichler, Markus Greiner, et al. A quantum processor based on coherent transport of entangled atom arrays. Nature, 604(7906):451–456, 2022

2022
[14]

Logical quantum processor based on reconfigurable atom arrays

Dolev Bluvstein, Simon J Evered, Alexandra A Geim, Sophie H Li, Hengyun Zhou, T om Manovitz, Sepehr Ebadi, Madelyn Cain, Marcin Kalinowski, Dominik Hangleiter, et al. Logical quantum processor based on reconfigurable atom arrays. Nature, 626(7997):58–65, 2024

2024
[15]

An improved example for an autoconvolution inequality

Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality. Ex- perimental Mathematics, 2026. doi: 10.1080/10586458.2025.2607423. Published online 2026-02-15

work page doi:10.1080/10586458.2025.2607423 2026
[16]

Brent, William Orrick, Judy anne Osborn, and Paul Zimmermann

Richard P . Brent, William Orrick, Judy anne Osborn, and Paul Zimmermann. Maximal determi- nants and saturated D-optimal designs of orders 19 and 37, 2011. URL https://arxiv.org/abs/ 1112.4160

work page arXiv 2011
[17]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky , Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review arXiv 2024
[18]

Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing , 16(5):1190–1208, 1995

1995
[19]

K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128, 2026

Shiyi Cao, Ziming Mao, Joseph E Gonzalez, and Ion Stoica. K-search: Llm kernel generation via Page 62 of 110 Evaluation-driven Scaling for Scientific Discovery co-evolving intrinsic world model. arXiv preprint arXiv:2602.19128, 2026

work page arXiv 2026
[20]

Cawley and Nicola L

Gavin C. Cawley and Nicola L. C. Talbot. On over-fitting in model selection and subsequent selec- tion bias in performance evaluation. Journal of Machine Learning Research , 11(70):2079–2107, 2010. URL https://www.jmlr.org/papers/v11/cawley10a.html

2079
[21]

Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. Adaevolve: Adaptive llm driven zeroth-order optimization. arXiv preprint arXiv:2602.20133, 2026. doi: 10.48550/ arXiv.2602.20133. URL https://arxiv.org/abs/2602.20133

work page arXiv 2026
[22]

On the role of feedback in test-time scaling of agentic ai workflows, 2025

Souradip Chakraborty , Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, and T omas Pfister. On the role of feedback in test-time scaling of agentic ai workflows, 2025. URL https://arxiv.org/abs/ 2504.01931

work page arXiv 2025
[23]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury , Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[24]

Parallel scaling law for language models, 2025

Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. Parallel scaling law for language models. arXiv preprint arXiv:2505.10475, 2025

work page arXiv 2025
[25]

T umix: Multi-agent test-time scaling with tool-use mixture, 2025

Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, T omas Pfister, and Jinsung Yoon. T umix: Multi-agent test-time scaling with tool-use mixture, 2025. URL https:// arxiv.org/abs/2510.01279

work page arXiv 2025
[26]

Cooley and John W

James W. Cooley and John W. T ukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301, 1965

1965
[27]

On the qubit routing problem

Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexandre Krajenbrink, Will Simmons, and Seyon Sivarajah. On the qubit routing problem. In 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2019) , volume 135 of Leibniz International Proceedings in In- formatics (LIPIcs) , pages 5:1–5:32. Schloss Dagstuhl – Leibniz-Zent...

work page doi:10.4230/lipics.tqc.2019.5 2019
[28]

Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286, 2026

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yu- fan Song, Hongli Yu, and et al. Chen, Jiaze. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation. arXiv preprint arXiv:2602.24286, 2026

work page arXiv 2026
[29]

Sandboxes, 2026

Daytona. Sandboxes, 2026. URL https://www.daytona.io/docs/en/sandboxes/. Documentation

2026
[30]

net/forum?id=nZeVKeeFYf9

Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development. arXiv preprint arXiv:2505.16975, 2025

work page arXiv 2025
[31]

E2b documentation, 2026

E2B. E2b documentation, 2026. URL https://e2b.dev/docs. Cloud sandboxing and code- interpreting documentation for AI agents

2026
[32]

arXiv preprint arXiv:2309.17179 , year=

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023

work page arXiv 2023
[33]

Richard P . Feynman. Simulating physics with computers. International Journal of Theoretical Physics , 21(6):467–488, 1982. ISSN 1572-9575. doi: 10.1007/BF02650179. URL https://doi.org/10.1007/ BF02650179

work page doi:10.1007/bf02650179 1982
[34]

Regularization paths for generalized linear models via coordinate descent

Jerome H Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33:1–22, 2010

2010
[35]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046, 1, 2025

work page internal anchor Pith review arXiv 2025
[36]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Ma- chine Learning Research , pages 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/ v202/gao23h.html

2023
[38]

Georgiev, J

Bogdan Georgiev , Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale. arXiv preprint arXiv:2511.02864, 2025. URL https://arxiv.org/ abs/2511.02864

work page arXiv 2025
[39]

Quantum error correction below the surface code threshold

Google Quantum AI et al. Quantum error correction below the surface code threshold. Nature, 638 (8052):920–926, 2025

2025
[40]

Gpu mode reference kernels, 2026

GPU Mode. Gpu mode reference kernels, 2026. URL https://github.com/gpu-mode/ reference-kernels

2026
[41]

Trimul competition, 2026

GPU Mode. Trimul competition, 2026. URL https://www.gpumode.com/leaderboard/496

2026
[42]

Gpu mode, 2026

GPU Mode. Gpu mode, 2026. URL https://www.gpumode.com/

2026
[43]

A fast quantum mechanical algorithm for database search

Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages 212–219, 1996

1996
[44]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Katalin Gyarmati, François Hennecart, and Imre Z. Ruzsa. Sums and differences of finite sets. Func- tiones et Approximatio Commentarii Mathematici , 37(1):175–186, 2007

2007
[46]

The minimum overlap problem revisited, 2016

Jan Kristian Haugland. The minimum overlap problem revisited, 2016. URL https://arxiv.org/ abs/1609.08000

work page arXiv 2016
[47]

Peter V . Hegarty. Some explicit constructions of sets with more sums than differences. Acta Arith- metica, 130(1):61–77, 2007. doi: 10.4064/aa130-1-4

work page doi:10.4064/aa130-1-4 2007
[48]

A literature review on circle and sphere packing problems: Models and methodologies

Mhand Hifi and Rym M’Hallah. A literature review on circle and sphere packing problems: Models and methodologies. Advances in Operations Research, 2009:150624, 2009. doi: 10.1155/2009/150624

work page doi:10.1155/2009/150624 2009
[49]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review arXiv 2025
[50]

Olympiad-level formal mathematical reasoning with reinforcement learning

Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1–3, 2025

2025
[51]

Ibm quantum computing: Hardware and roadmap, 2026

IBM Quantum. Ibm quantum computing: Hardware and roadmap, 2026. URL https://www.ibm. com/quantum/hardware

2026
[52]

Autonomous LLM-driven research – from data to human-verifiable research papers

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay , and Roy Kishony. Autonomous LLM-driven research – from data to human-verifiable research papers. NEJM AI , 2(1), 2025. doi: 10.1056/ AIoa2400555. URL https://doi.org/10.1056/AIoa2400555

work page doi:10.1056/aioa2400555 2025
[53]

Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering, 2025. URL https://arxiv.org/abs/2506.09050

work page arXiv 2025
[54]

Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412, 2025

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search, 2025. URL https://arxiv.org/abs/2503.04412

work page arXiv 2025
[55]

Algorithmic theory of qubit routing

Takehiro Ito, Naonori Kakimura, Naoyuki Kamiyama, Yusuke Kobayashi, and Yoshio Okamoto. Algorithmic theory of qubit routing. In Algorithms and Data Structures Symposium , pages 533–546. Springer, 2023

2023
[56]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky , Aiden Low , Alec Helyar, Aleksander Madry , Alex Beutel, Alex Carney , et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- othée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Associ- ation for Computational Linguistics (V olume 1: Long Papers), pages 14165–14178, 2023

2023
[59]

SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Page 64 of 110 Evaluation-driven Scaling for Scientific Discovery Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= VTF8yNQM66

2024
[60]

autoresearch: A simple and eﬀicient AI agent for autonomous ML research

Andrej Karpathy. autoresearch: A simple and eﬀicient AI agent for autonomous ML research. https://github.com/karpathy/autoresearch, 2026

2026
[61]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. T owards a science of scaling agent systems, 2025. URL https://arxiv.org/abs/2512.08296

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Superconducting qubits: Current state of play

Morten Kjaergaard, Mollie E Schwartz, Jochen Braumüller, Philip Krantz, Joel I-J Wang, Simon Gustavsson, and William D Oliver. Superconducting qubits: Current state of play. Annual Review of Condensed Matter Physics , 11(1):369–395, 2020

2020
[63]

Lamb, Jialin Yu, Philip H

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: T owards open-ended and sample-eﬀicient program evolution. arXiv preprint arXiv:2509.19349, 2025. doi: 10.48550/arXiv.2509. 19349. URL https://arxiv.org/abs/2509.19349

work page doi:10.48550/arxiv.2509 2025
[64]

MIT Press, Cambridge, MA (2021), https://mitpress.mit.edu/9780262044776

Patrick W. Langley , Herbert A. Simon, Gary Bradshaw , and Jan M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Process . MIT Press, Cambridge, MA, 1987. URL https: //mitpress.mit.edu/9780262620529/scientific-discovery/

work page arXiv 1987
[65]

Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation

Ang Li, Samuel Stein, Sriram Krishnamoorthy , and James Ang. Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation. ACM T ransactions on Quantum Computing, 4 (2):1–26, 2023

2023
[66]

Tackling the qubit mapping problem for nisq-era quantum de- vices

Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem for nisq-era quantum de- vices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, pages 1001–1014, 2019

2019
[67]

Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining

Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, et al. Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining. arXiv e-prints, pages arXiv–2503, 2025

2025
[68]

Selecting large language model to fine-tune via rectified scaling law

Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. In International Conference on Machine Learning , 2024

2024
[69]

Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xi- angyu Wang, Jianzhu Ma, Yitao Liang, and James Y . Zou. Can language models discover scaling laws? In International Conference on Learning Representations , 2026. URL https://openreview.net/ forum?id=TPTtWC0pGk

2026
[70]

Reuse-aware compilation for zoned quantum architectures based on neutral atoms

Wan-Hsuan Lin, Daniel Bochen Tan, and Jason Cong. Reuse-aware compilation for zoned quantum architectures based on neutral atoms. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 127–142. IEEE, 2025

2025
[71]

Zero-preserving imputation of single-cell rna-seq data

George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. Zero-preserving imputation of single-cell rna-seq data. Nature communications, 13(1): 192, 2022

2022
[72]

Evox: Meta-evolution for automated discovery

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413 , 2026. doi: 10.48550/arXi...

work page doi:10.48550/arxiv.2602.23413 2026
[73]

arXiv:2511.17006 (2025)

Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen, Ke Jiang, Samira Daruki, Yi Liang, William Yang Wang, T omas Pfister, and Chen- Yu Lee. Budget-aware tool-use enables effective agent scaling, 2025. URL https://arxiv.org/abs/ 2511.17006

work page arXiv 2025
[74]

Universal quantum simulators

Seth Lloyd. Universal quantum simulators. Science, 273(5278):1073–1078, 1996

1996
[75]

Lange, et al

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. T owards end-to-end automation of ai research. Nature, 651(8107):914–919, 2026. doi: 10.1038/s41586-026-10265-5. URL https://doi.org/10.1038/s41586-026-10265-5 . Page 65 of 110 Evaluation-driven Scaling for Scientific Discovery

work page doi:10.1038/s41586-026-10265-5 2026
[76]

Current best practices in single-cell rna-seq analysis: a tutorial

Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):MSB188746, 2019

2019
[77]

Defining and benchmarking open problems in single-cell analysis

Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Niko- lay S Markov , Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov , et al. Defining and benchmarking open problems in single-cell analysis. Nature Biotechnology, 43(7):1035–1040, 2025

2025
[78]

Highly parallel genome- wide expression profiling of individual cells using nanoliter droplets

Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome- wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015

2015
[79]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing S...

2023
[80]

Pólya urn models

Hosam Mahmoud. Pólya urn models. Chapman and Hall/CRC, 2008

2008
[81]

Many sets have more sums than differences, 2006

Greg Martin and Kevin O’Bryant. Many sets have more sums than differences, 2006. URL https: //arxiv.org/abs/math/0608131

work page arXiv 2006

Showing first 80 references.