Recognition: unknown
Evaluation-driven Scaling for Scientific Discovery
Pith reviewed 2026-05-10 03:36 UTC · model grok-4.3
The pith
Simple test-time scaling of evaluation loops lets open models discover better scientific solutions than frontier systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SimpleTES scales evaluation-driven discovery by strategically combining parallel exploration, feedback-driven refinement, and local selection; when applied to gpt-oss models it produces state-of-the-art solutions on 21 scientific problems spanning six domains, outperforming both frontier-model baselines and sophisticated optimization pipelines while also generating successful trajectories that improve subsequent model performance through post-training.
What carries the argument
Simple Test-time Evaluation-driven Scaling (SimpleTES), a framework that amplifies the impact of verifiers, simulators, or scoring functions by running many candidates in parallel, refining them based on evaluation feedback, and selecting locally optimal trajectories.
If this is right
- Evaluation scaling becomes a practical axis for advancing LLM-driven discovery independent of further pre-training gains.
- Trajectory histories collected during successful discoveries can be reused to post-train models that solve both seen and unseen problems more efficiently.
- Open-weight models equipped with this loop can surpass closed frontier models on concrete algorithmic and combinatorial tasks.
- Specific domains see measurable gains such as more than 2x speedup for LASSO, 24.5% reduction in quantum gate overhead, and new record constructions for the Erdos minimum overlap problem.
Where Pith is reading between the lines
- If reliable verifiers can be engineered for additional domains, the same scaling approach may accelerate discovery in fields currently limited by weak feedback signals.
- The emphasis on test-time loops suggests that future model development could prioritize training for effective use of external evaluation rather than solely increasing raw capability.
- Post-training on discovery trajectories may create a feedback cycle where each round of scaling produces data that makes the next round more effective.
Load-bearing premise
Reliable, unbiased verifiers or task-specific scoring functions exist for the problems and can steer refinement without systematic errors or hidden constraints on the search space.
What would settle it
Running SimpleTES on a fresh scientific problem whose verifier is known to be noisy or biased produces no improvement over direct prompting or yields solutions that fail independent verification while a non-scaled baseline succeeds.
read the original abstract
Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Simple Test-time Evaluation-driven Scaling (SimpleTES), a framework that combines parallel exploration, feedback-driven refinement, and local selection to scale evaluation-driven discovery loops using LLMs. It reports that SimpleTES, applied with gpt-oss models, achieves state-of-the-art results on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines. Specific claims include a >2x speedup on the LASSO algorithm, a 24.5% reduction in gate overhead for quantum circuit routing, and new Erdos minimum-overlap constructions that surpass prior best-known results. The work also shows that successful discovery trajectories can supervise post-training, improving efficiency on seen problems and enabling generalization to unseen ones.
Significance. If the empirical results hold under properly validated evaluators, the work would be significant for LLM-driven scientific discovery by framing evaluation scaling as a central, actionable axis and offering a simple, practical framework. The trajectory-based post-training component is a notable strength, as it converts discovery histories into reusable supervision signals that demonstrably improve both in-domain efficiency and out-of-domain generalization. These elements could influence how future systems integrate verifiers and simulators into iterative loops.
major comments (2)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): The SOTA claims, including the new Erdos minimum-overlap constructions, >2x LASSO speedup, and 24.5% quantum-routing improvement, rest on problem-specific verifiers and scoring functions. No independent validation, cross-checks against external oracles or full literature baselines, or ablations on scorer fidelity/approximation error are reported. This is load-bearing for the central empirical claims, as any systematic bias or incompleteness in the scorers would turn reported improvements into artifacts.
- [§4] §4 (Experimental Setup): No details are provided on experimental controls, statistical tests for significance, variance across LLM sampling runs, or exact baseline implementations (e.g., how frontier-model and optimization-pipeline comparisons were configured). Without these, it is impossible to determine whether the consistent outperformance is robust or attributable to SimpleTES rather than implementation choices or stochastic effects.
minor comments (2)
- [Abstract and §3] Abstract and §3: The term 'gpt-oss models' is used without an explicit definition or list of the specific models and versions employed; this should be clarified for reproducibility.
- [§5] §5: Trajectory-level histories are described as naturally supervising feedback-driven learning, but the post-training protocol (data filtering, loss formulation, training hyperparameters) receives only high-level treatment; a dedicated subsection or appendix would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The SOTA claims, including the new Erdos minimum-overlap constructions, >2x LASSO speedup, and 24.5% quantum-routing improvement, rest on problem-specific verifiers and scoring functions. No independent validation, cross-checks against external oracles or full literature baselines, or ablations on scorer fidelity/approximation error are reported. This is load-bearing for the central empirical claims, as any systematic bias or incompleteness in the scorers would turn reported improvements into artifacts.
Authors: We agree that the reliability of the reported improvements depends on the verifiers. For each of the 21 problems, the scoring functions are drawn from established, objective metrics in the respective literatures (e.g., wall-clock runtime via standard solvers for LASSO, exact gate-count simulation for quantum routing, and direct mathematical verification for Erdős constructions). In the revised manuscript we have added a new subsection to §4 that explicitly documents every verifier, its implementation, known limitations, and any approximation error bounds. For the new Erdős constructions we now include the explicit solutions together with a verification script in the supplementary material. We have also added a targeted ablation on scorer fidelity for the subset of problems that use approximate evaluators, confirming that the reported gains remain stable under reasonable perturbations. While we cannot perform external oracle validation within the scope of this work, the added documentation and code enable independent reproduction and checking. We maintain that the improvements are not artifacts because all methods (SimpleTES, frontier models, and optimization baselines) were evaluated under identical verifiers. revision: partial
-
Referee: [§4] §4 (Experimental Setup): No details are provided on experimental controls, statistical tests for significance, variance across LLM sampling runs, or exact baseline implementations (e.g., how frontier-model and optimization-pipeline comparisons were configured). Without these, it is impossible to determine whether the consistent outperformance is robust or attributable to SimpleTES rather than implementation choices or stochastic effects.
Authors: We acknowledge that the original §4 lacked sufficient experimental detail. In the revised version we have expanded §4 with the following additions: (1) precise configurations and prompt templates for all frontier-model and optimization-pipeline baselines; (2) the number of independent sampling runs (five per problem, different random seeds) together with reported standard deviations; (3) statistical significance testing via paired t-tests with p-values now shown in the result tables; and (4) fixed sampling hyperparameters (temperature, top-p, etc.) across all compared methods. These controls demonstrate that the observed gains are robust and attributable to the SimpleTES framework rather than implementation or stochastic variation. revision: yes
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper presents SimpleTES as an empirical framework combining parallel exploration, feedback-driven refinement, and local selection, then reports performance gains on 21 problems against external baselines (LASSO runtime, quantum gate counts, Erdos overlap records). No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. Post-training on trajectories is described as an additional observed outcome rather than a definitional step. All central claims rest on comparisons to independent SOTA and baselines, making the work self-contained against external metrics.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Reference graph
Works this paper leans on
-
[1]
Accurate structure predic- tion of biomolecular interactions with alphafold 3
Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ron- neberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure predic- tion of biomolecular interactions with alphafold 3. Nature, 630(8016):493–500, 2024
2024
-
[2]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv...
work page internal anchor Pith review doi:10.48550/arxiv.2507.19457 2025
-
[3]
Natarajan, and K
Nasir Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE T ransactions on Computers, 23(1):90–93, 1974
1974
-
[4]
FARS: Fully automated research system
Analemma AI. FARS: Fully automated research system. https://analemma.ai/fars/, 2026
2026
-
[5]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Con- crete problems in ai safety , 2016. URL https://arxiv.org/abs/1606.06565
work page internal anchor Pith review arXiv 2016
-
[6]
Openevolve: an open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https: //github.com/algorithmicsuperintelligence/openevolve. GitHub repository
2025
-
[7]
Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: an open source evolutionary coding agent for algorithmic discovery and optimization. arXiv preprint arXiv:2510.14150, 2025. doi: 10.48550/arXiv.2510.14150. URL https://arxiv.org/abs/2510.14150
-
[8]
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. pages 6709–6738. Association for Computational Linguistics, 2025. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.342. URL https://aclanthology.org/2025.naacl-long.342/
-
[9]
Barnard and Stefan Steinerberger
Richard C. Barnard and Stefan Steinerberger. Three convolution inequalities on the real line with connections to additive combinatorics. Journal of Number Theory , 207:42–55, 2020. ISSN 0022-314X. doi: https://doi.org/10.1016/j.jnt.2019.07.001. URL https://www.sciencedirect.com/ science/article/pii/S0022314X19302549
-
[10]
Molecular cross-validation for single-cell rna-seq
Joshua Batson, Loïc Royer, and James Webber. Molecular cross-validation for single-cell rna-seq. BioRxiv, page 786269, 2019
2019
-
[11]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, T omasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024
2024
-
[12]
Aster: Autonomous scientific discovery over 20x faster than existing methods
Emmett Bicker. Aster: Autonomous scientific discovery over 20x faster than existing methods. arXiv preprint arXiv:2602.07040, 2026
-
[13]
A quantum processor based on coherent transport of entangled atom arrays
Dolev Bluvstein, Harry Levine, Giulia Semeghini, T out T Wang, Sepehr Ebadi, Marcin Kalinowski, Alexander Keesling, Nishad Maskara, Hannes Pichler, Markus Greiner, et al. A quantum processor based on coherent transport of entangled atom arrays. Nature, 604(7906):451–456, 2022
2022
-
[14]
Logical quantum processor based on reconfigurable atom arrays
Dolev Bluvstein, Simon J Evered, Alexandra A Geim, Sophie H Li, Hengyun Zhou, T om Manovitz, Sepehr Ebadi, Madelyn Cain, Marcin Kalinowski, Dominik Hangleiter, et al. Logical quantum processor based on reconfigurable atom arrays. Nature, 626(7997):58–65, 2024
2024
-
[15]
An improved example for an autoconvolution inequality
Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality. Ex- perimental Mathematics, 2026. doi: 10.1080/10586458.2025.2607423. Published online 2026-02-15
-
[16]
Brent, William Orrick, Judy anne Osborn, and Paul Zimmermann
Richard P . Brent, William Orrick, Judy anne Osborn, and Paul Zimmermann. Maximal determi- nants and saturated D-optimal designs of orders 19 and 37, 2011. URL https://arxiv.org/abs/ 1112.4160
-
[17]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky , Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review arXiv 2024
-
[18]
Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu
Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing , 16(5):1190–1208, 1995
1995
-
[19]
Shiyi Cao, Ziming Mao, Joseph E Gonzalez, and Ion Stoica. K-search: Llm kernel generation via Page 62 of 110 Evaluation-driven Scaling for Scientific Discovery co-evolving intrinsic world model. arXiv preprint arXiv:2602.19128, 2026
-
[20]
Cawley and Nicola L
Gavin C. Cawley and Nicola L. C. Talbot. On over-fitting in model selection and subsequent selec- tion bias in performance evaluation. Journal of Machine Learning Research , 11(70):2079–2107, 2010. URL https://www.jmlr.org/papers/v11/cawley10a.html
2079
-
[21]
Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. Adaevolve: Adaptive llm driven zeroth-order optimization. arXiv preprint arXiv:2602.20133, 2026. doi: 10.48550/ arXiv.2602.20133. URL https://arxiv.org/abs/2602.20133
-
[22]
On the role of feedback in test-time scaling of agentic ai workflows, 2025
Souradip Chakraborty , Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, and T omas Pfister. On the role of feedback in test-time scaling of agentic ai workflows, 2025. URL https://arxiv.org/abs/ 2504.01931
-
[23]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury , Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024
work page Pith review arXiv 2024
-
[24]
Parallel scaling law for language models, 2025
Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. Parallel scaling law for language models. arXiv preprint arXiv:2505.10475, 2025
-
[25]
T umix: Multi-agent test-time scaling with tool-use mixture, 2025
Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, T omas Pfister, and Jinsung Yoon. T umix: Multi-agent test-time scaling with tool-use mixture, 2025. URL https:// arxiv.org/abs/2510.01279
-
[26]
Cooley and John W
James W. Cooley and John W. T ukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301, 1965
1965
-
[27]
Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexandre Krajenbrink, Will Simmons, and Seyon Sivarajah. On the qubit routing problem. In 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2019) , volume 135 of Leibniz International Proceedings in In- formatics (LIPIcs) , pages 5:1–5:32. Schloss Dagstuhl – Leibniz-Zent...
-
[28]
Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yu- fan Song, Hongli Yu, and et al. Chen, Jiaze. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation. arXiv preprint arXiv:2602.24286, 2026
-
[29]
Sandboxes, 2026
Daytona. Sandboxes, 2026. URL https://www.daytona.io/docs/en/sandboxes/. Documentation
2026
-
[30]
Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development. arXiv preprint arXiv:2505.16975, 2025
-
[31]
E2b documentation, 2026
E2B. E2b documentation, 2026. URL https://e2b.dev/docs. Cloud sandboxing and code- interpreting documentation for AI agents
2026
-
[32]
arXiv preprint arXiv:2309.17179 , year=
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023
-
[33]
Richard P . Feynman. Simulating physics with computers. International Journal of Theoretical Physics , 21(6):467–488, 1982. ISSN 1572-9575. doi: 10.1007/BF02650179. URL https://doi.org/10.1007/ BF02650179
-
[34]
Regularization paths for generalized linear models via coordinate descent
Jerome H Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33:1–22, 2010
2010
-
[35]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046, 1, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Ma- chine Learning Research , pages 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/ v202/gao23h.html
2023
-
[38]
Bogdan Georgiev , Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale. arXiv preprint arXiv:2511.02864, 2025. URL https://arxiv.org/ abs/2511.02864
-
[39]
Quantum error correction below the surface code threshold
Google Quantum AI et al. Quantum error correction below the surface code threshold. Nature, 638 (8052):920–926, 2025
2025
-
[40]
Gpu mode reference kernels, 2026
GPU Mode. Gpu mode reference kernels, 2026. URL https://github.com/gpu-mode/ reference-kernels
2026
-
[41]
Trimul competition, 2026
GPU Mode. Trimul competition, 2026. URL https://www.gpumode.com/leaderboard/496
2026
-
[42]
Gpu mode, 2026
GPU Mode. Gpu mode, 2026. URL https://www.gpumode.com/
2026
-
[43]
A fast quantum mechanical algorithm for database search
Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages 212–219, 1996
1996
-
[44]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Katalin Gyarmati, François Hennecart, and Imre Z. Ruzsa. Sums and differences of finite sets. Func- tiones et Approximatio Commentarii Mathematici , 37(1):175–186, 2007
2007
-
[46]
The minimum overlap problem revisited, 2016
Jan Kristian Haugland. The minimum overlap problem revisited, 2016. URL https://arxiv.org/ abs/1609.08000
-
[47]
Peter V . Hegarty. Some explicit constructions of sets with more sums than differences. Acta Arith- metica, 130(1):61–77, 2007. doi: 10.4064/aa130-1-4
-
[48]
A literature review on circle and sphere packing problems: Models and methodologies
Mhand Hifi and Rym M’Hallah. A literature review on circle and sphere packing problems: Models and methodologies. Advances in Operations Research, 2009:150624, 2009. doi: 10.1155/2009/150624
-
[49]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025
work page internal anchor Pith review arXiv 2025
-
[50]
Olympiad-level formal mathematical reasoning with reinforcement learning
Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1–3, 2025
2025
-
[51]
Ibm quantum computing: Hardware and roadmap, 2026
IBM Quantum. Ibm quantum computing: Hardware and roadmap, 2026. URL https://www.ibm. com/quantum/hardware
2026
-
[52]
Autonomous LLM-driven research – from data to human-verifiable research papers
Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay , and Roy Kishony. Autonomous LLM-driven research – from data to human-verifiable research papers. NEJM AI , 2(1), 2025. doi: 10.1056/ AIoa2400555. URL https://doi.org/10.1056/AIoa2400555
-
[53]
Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering, 2025. URL https://arxiv.org/abs/2506.09050
-
[54]
Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search, 2025. URL https://arxiv.org/abs/2503.04412
-
[55]
Algorithmic theory of qubit routing
Takehiro Ito, Naonori Kakimura, Naoyuki Kamiyama, Yusuke Kobayashi, and Yoshio Okamoto. Algorithmic theory of qubit routing. In Algorithms and Data Structures Symposium , pages 533–546. Springer, 2023
2023
-
[56]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky , Aiden Low , Alec Helyar, Aleksander Madry , Alex Beutel, Alex Carney , et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- othée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Associ- ation for Computational Linguistics (V olume 1: Long Papers), pages 14165–14178, 2023
2023
-
[59]
SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Page 64 of 110 Evaluation-driven Scaling for Scientific Discovery Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= VTF8yNQM66
2024
-
[60]
autoresearch: A simple and efficient AI agent for autonomous ML research
Andrej Karpathy. autoresearch: A simple and efficient AI agent for autonomous ML research. https://github.com/karpathy/autoresearch, 2026
2026
-
[61]
Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. T owards a science of scaling agent systems, 2025. URL https://arxiv.org/abs/2512.08296
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Superconducting qubits: Current state of play
Morten Kjaergaard, Mollie E Schwartz, Jochen Braumüller, Philip Krantz, Joel I-J Wang, Simon Gustavsson, and William D Oliver. Superconducting qubits: Current state of play. Annual Review of Condensed Matter Physics , 11(1):369–395, 2020
2020
-
[63]
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: T owards open-ended and sample-efficient program evolution. arXiv preprint arXiv:2509.19349, 2025. doi: 10.48550/arXiv.2509. 19349. URL https://arxiv.org/abs/2509.19349
-
[64]
MIT Press, Cambridge, MA (2021), https://mitpress.mit.edu/9780262044776
Patrick W. Langley , Herbert A. Simon, Gary Bradshaw , and Jan M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Process . MIT Press, Cambridge, MA, 1987. URL https: //mitpress.mit.edu/9780262620529/scientific-discovery/
-
[65]
Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation
Ang Li, Samuel Stein, Sriram Krishnamoorthy , and James Ang. Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation. ACM T ransactions on Quantum Computing, 4 (2):1–26, 2023
2023
-
[66]
Tackling the qubit mapping problem for nisq-era quantum de- vices
Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem for nisq-era quantum de- vices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, pages 1001–1014, 2019
2019
-
[67]
Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining
Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, et al. Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining. arXiv e-prints, pages arXiv–2503, 2025
2025
-
[68]
Selecting large language model to fine-tune via rectified scaling law
Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. In International Conference on Machine Learning , 2024
2024
-
[69]
Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xi- angyu Wang, Jianzhu Ma, Yitao Liang, and James Y . Zou. Can language models discover scaling laws? In International Conference on Learning Representations , 2026. URL https://openreview.net/ forum?id=TPTtWC0pGk
2026
-
[70]
Reuse-aware compilation for zoned quantum architectures based on neutral atoms
Wan-Hsuan Lin, Daniel Bochen Tan, and Jason Cong. Reuse-aware compilation for zoned quantum architectures based on neutral atoms. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 127–142. IEEE, 2025
2025
-
[71]
Zero-preserving imputation of single-cell rna-seq data
George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. Zero-preserving imputation of single-cell rna-seq data. Nature communications, 13(1): 192, 2022
2022
-
[72]
Evox: Meta-evolution for automated discovery
Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413 , 2026. doi: 10.48550/arXi...
-
[73]
Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen, Ke Jiang, Samira Daruki, Yi Liang, William Yang Wang, T omas Pfister, and Chen- Yu Lee. Budget-aware tool-use enables effective agent scaling, 2025. URL https://arxiv.org/abs/ 2511.17006
-
[74]
Universal quantum simulators
Seth Lloyd. Universal quantum simulators. Science, 273(5278):1073–1078, 1996
1996
-
[75]
Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. T owards end-to-end automation of ai research. Nature, 651(8107):914–919, 2026. doi: 10.1038/s41586-026-10265-5. URL https://doi.org/10.1038/s41586-026-10265-5 . Page 65 of 110 Evaluation-driven Scaling for Scientific Discovery
-
[76]
Current best practices in single-cell rna-seq analysis: a tutorial
Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):MSB188746, 2019
2019
-
[77]
Defining and benchmarking open problems in single-cell analysis
Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Niko- lay S Markov , Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov , et al. Defining and benchmarking open problems in single-cell analysis. Nature Biotechnology, 43(7):1035–1040, 2025
2025
-
[78]
Highly parallel genome- wide expression profiling of individual cells using nanoliter droplets
Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome- wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015
2015
-
[79]
Self- refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing S...
2023
-
[80]
Pólya urn models
Hosam Mahmoud. Pólya urn models. Chapman and Hall/CRC, 2008
2008
-
[81]
Many sets have more sums than differences, 2006
Greg Martin and Kevin O’Bryant. Many sets have more sums than differences, 2006. URL https: //arxiv.org/abs/math/0608131
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.