Recognition: 3 theorem links
· Lean TheoremLearning to Discover at Test Time
Pith reviewed 2026-05-16 05:11 UTC · model grok-4.3
The pith
Reinforcement learning at test time on one problem lets an open LLM produce new state-of-the-art solutions for math, coding, and biology tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TTT-Discover performs reinforcement learning at test time so the LLM continues to train with experience specific to the test problem. The learning objective and search subroutine are designed to prioritize the most promising solutions and thereby produce one great solution for that exact problem rather than many good ones on average. Applied across mathematics, GPU kernel engineering, algorithm design, and biology, the method sets new state-of-the-art results on Erdős' minimum overlap problem, an autocorrelation inequality, a GPUMode kernel competition with up to 2 times faster kernels, past AtCoder algorithm competitions, and a denoising problem in single-cell analysis, all achieved with an
What carries the argument
Test-Time Training to Discover (TTT-Discover), which applies reinforcement learning at test time on experience gathered from one specific problem to refine the model toward a single superior solution.
If this is right
- New state-of-the-art solutions become reachable for continuous-reward problems in mathematics, engineering, algorithms, and biology using open models.
- Test-time reinforcement learning can outperform prompting a frozen LLM for discovery-oriented search.
- Expert-reviewed improvements are achievable in GPU kernel speed and algorithm contest performance.
- Results remain reproducible with publicly available code at a cost of a few hundred dollars per problem.
- The same training loop can be applied directly to new problems without retraining on a broad distribution.
Where Pith is reading between the lines
- If test-time training scales reliably, teams could solve narrow but high-value scientific problems by investing modest compute on one instance rather than retraining large models.
- The approach might transfer to non-language models if the reinforcement signal can be defined for other continuous optimization domains.
- Repeated application across related problems could accumulate specialized knowledge inside a single model instance without full retraining.
- Human experts could supply the reward function or final validation step to steer the search toward practically useful rather than merely high-scoring solutions.
Load-bearing premise
That reinforcement learning performed at test time on experience specific to one problem will reliably produce a single superior solution rather than overfitting or failing to improve over frozen-model search.
What would settle it
Reproducing the method on one of the reported problems, such as the GPUMode kernel task, and failing to match or exceed the claimed performance gains with the same open model.
read the original abstract
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erd\H{o}s' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Test-Time Training to Discover (TTT-Discover), which performs reinforcement learning at test time on experience specific to a single test problem. The goal is to produce one superior solution rather than average performance across problems. The authors apply this to continuous-reward tasks and report new state-of-the-art results on Erdős' minimum overlap problem and an autocorrelation inequality, a GPUMode kernel competition (up to 2× faster), past AtCoder algorithm competitions, and a single-cell denoising task, all using the open gpt-oss-120b model with publicly released code.
Significance. If the central claims are substantiated, the work would be significant for demonstrating that problem-specific test-time RL can yield discovery-level improvements across mathematics, systems engineering, algorithms, and biology while using only an open model and modest compute budgets. The emphasis on reproducibility via public code and the contrast with prior closed-model results are concrete strengths that would lower barriers to AI-assisted discovery if the performance gains are shown to arise from the adaptation mechanism rather than extended search alone.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): No ablation is reported that isolates the contribution of online weight updates from simply running longer search with a frozen model. For the Erdős overlap and GPUMode tasks, any reported improvement could be explained by increased inference-time compute rather than the continual-learning component; without this separation the attribution of SOTA results to TTT-Discover is not secured.
- [§3] §3 (Method): The claim that the learning objective and search subroutine have been redesigned to prioritize promising solutions is load-bearing for the single-solution focus, yet the manuscript supplies no equations, pseudocode, or quantitative comparison showing how these changes differ from standard RL or prevent overfitting on a single instance.
minor comments (2)
- [Abstract] Abstract: The statement that results were 'reviewed by experts or the organizers' would be strengthened by naming the reviewers or providing links to the review process for each domain.
- [Throughout] Throughout: The paper would benefit from explicit reporting of wall-clock time, number of RL steps, and variance across random seeds for each claimed improvement to allow direct comparison with prior frozen-LLM baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the positive assessment of the work's potential significance. We address each major comment below and commit to revisions that will strengthen the manuscript's claims regarding the contributions of test-time training.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): No ablation is reported that isolates the contribution of online weight updates from simply running longer search with a frozen model. For the Erdős overlap and GPUMode tasks, any reported improvement could be explained by increased inference-time compute rather than the continual-learning component; without this separation the attribution of SOTA results to TTT-Discover is not secured.
Authors: We agree that isolating the effect of online weight updates from extended inference-time search with a frozen model is crucial for attributing the performance gains to the TTT-Discover mechanism. In the revised version, we will add ablations for the Erdős minimum overlap and GPUMode tasks. These will compare TTT-Discover against a frozen-model baseline that uses the same total compute budget but without weight updates, employing standard prompting or search methods. This will demonstrate whether the continual learning component provides benefits beyond increased search effort. revision: yes
-
Referee: [§3] §3 (Method): The claim that the learning objective and search subroutine have been redesigned to prioritize promising solutions is load-bearing for the single-solution focus, yet the manuscript supplies no equations, pseudocode, or quantitative comparison showing how these changes differ from standard RL or prevent overfitting on a single instance.
Authors: We recognize the need for greater formality in describing the modifications to the learning objective and search subroutine. In the revision, we will include explicit equations for the redesigned objective function that emphasizes promising solutions, along with pseudocode for the adapted search procedure. Additionally, we will provide a quantitative comparison to standard RL methods, including metrics on solution prioritization and overfitting prevention, such as reward concentration on top solutions and generalization within the single-instance setting. revision: yes
Circularity Check
No circularity: method applies standard RL at test time with no self-referential derivations
full rationale
The paper presents TTT-Discover as an application of reinforcement learning performed at test time on problem-specific experience, with objectives and search subroutines redesigned to prioritize promising solutions for a single instance rather than average performance. No equations, derivations, or parameter-fitting steps are described that would reduce any claimed result to its own inputs by construction. The approach is framed as a direct extension of existing RL techniques to the test-time setting, with empirical SOTA results reported across tasks; these outcomes are presented as experimental findings rather than outputs of a closed mathematical chain. No self-citation load-bearing premises, uniqueness theorems imported from prior author work, or ansatz smuggling appear in the description. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LawOfExistenceexistence_economically_inevitable echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search
LLM-reinforced evolutionary search produces exact values Z(11,21,3,3)=116, Z(11,22,3,3)=121, Z(12,22,3,3)=132 and lower bounds for 41 additional Zarankiewicz numbers.
-
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
-
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
-
Epistemic Uncertainty for Test-Time Discovery
UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
-
What should post-training optimize? A test-time scaling law perspective
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation
A jointly learned hierarchical index with cross-attention and residual quantization scales exact retrieval in foundational recommendation models, deployed at Meta with additional performance from test-time training on...
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution
TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...
-
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
Grokability in five inequalities
Five improved inequalities were found with AI help: better Gaussian perimeter bounds for convex sets, sharper L2-L1 moments on the Hamming cube, a strengthened autoconvolution inequality, improved g-Sidon set bounds, ...
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024
- [3]
-
[4]
Test-time Offline Reinforcement Learning on Goal-related Experience
Marco Bagatella, Mert Albaba, Jonas Hübotter, Georg Martius, and Andreas Krause. Test-time offline reinforcement learning on goal-related experience.arXiv preprint arXiv:2507.18809, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Richard C Barnard and Stefan Steinerberger. Three convolution inequalities on the real line with connections to additive combinatorics.Journal of Number Theory, 207:42–55, 2020
work page 2020
-
[6]
Molecular cross-validation for single-cell rna-seq.BioRxiv, page 786269, 2019
Joshua Batson, Loic Royer, and James Webber. Molecular cross-validation for single-cell rna-seq.BioRxiv, page 786269, 2019
work page 2019
-
[7]
Neural Combinatorial Optimization with Reinforcement Learning
Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Local learning algorithms.Neural computation, 4(6):888–900, 1992
Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992
work page 1992
-
[9]
An improved example for an autoconvolution inequality.arXiv preprint arXiv:2506.16750, 2025
Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality.arXiv preprint arXiv:2506.16750, 2025
-
[10]
William S Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979
work page 1979
-
[11]
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021
work page 2021
-
[12]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[14]
Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025
Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025
-
[15]
Dynamic few-shot visual learning without forgetting
Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 20
work page 2018
-
[16]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025
work page 2025
-
[17]
Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020
work page 2020
-
[18]
Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models.arXiv preprint arXiv:2305.18466, 2023
-
[19]
Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017
Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017
work page 2017
-
[20]
The minimum overlap problem revisited
Jan Kristian Haugland. The minimum overlap problem revisited.arXiv preprint arXiv:1609.08000, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. Masked autoen- coders are scalable vision learners.CoRR, abs/2111.06377, 2021
work page internal anchor Pith review arXiv 2021
-
[22]
The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021
work page 2021
-
[23]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
work page 2022
- [24]
-
[25]
Efficiently learning at test-time: Active fine-tuning of llms.arXiv preprint arXiv:2410.08020, 2024
Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of llms.arXiv preprint arXiv:2410.08020, 2024
-
[26]
Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, and Moritz Hardt. Learning on the job: Test-time curricula for targeted reinforcement learning.arXiv preprint arXiv:2510.04786, 2025
-
[27]
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025
-
[28]
Online domain adaptation of a pre-trained cascade of classifiers
Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011
work page 2011
-
[29]
Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025
-
[30]
Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021
work page 2021
-
[31]
Continual pre-training of language models.arXiv preprint arXiv:2302.03241,
Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv preprint arXiv:2302.03241, 2023
-
[32]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[34]
Wilds: A benchmark of in-the-wild distribution shifts
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. InInternational conference on machine learning, pages 5637–5664. PMLR, 2021. 21
work page 2021
-
[35]
Dynamic evaluation of neural sequence models
Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. InInternational Conference on Machine Learning, pages 2766–2775. PMLR, 2018
work page 2018
- [36]
-
[37]
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349, 2025
-
[38]
Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025
-
[39]
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017
work page 2017
-
[40]
Zero-preserving imputation of single-cell rna-seq data.Nature communications, 13(1):192, 2022
George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. Zero-preserving imputation of single-cell rna-seq data.Nature communications, 13(1):192, 2022
work page 2022
-
[41]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Fei Liu, Rui Zhang, Zhuoliang Xie, Rui Sun, Kai Li, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Llm4ad: A platform for algorithm design with large language model.arXiv preprint arXiv:2412.17287, 2024
-
[43]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017
work page 2017
-
[44]
Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Nikolay S Markov, Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov, et al. Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, pages 1–6, 2025
work page 2025
-
[45]
Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020
Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020
work page 2020
-
[46]
Máté Matolcsi and Carlos Vinuesa. Improved bounds on the supremum of autoconvolutions.Journal of Mathematical Analysis and Applications, 372(2):439–447, 2010
work page 2010
-
[47]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[48]
The effect of natural distribution shift on question answering models
John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. The effect of natural distribution shift on question answering models. InInternational conference on machine learning, pages 6905–6916. PMLR, 2020
work page 2020
-
[49]
Online model distillation for efficient video inference.arXiv preprint arXiv:1812.02699, 2018
Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference.arXiv preprint arXiv:1812.02699, 2018
-
[50]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [51]
-
[52]
Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025
Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, and Andrew McCallum. Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025
-
[53]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[54]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020. 22
-
[55]
Christopher D Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011
work page 2011
-
[56]
Submission #59660035 — third programming contest 2024 (atcoder heuristic contest 039)
Sakana. Submission #59660035 — third programming contest 2024 (atcoder heuristic contest 039). https://atcoder.jp/contests/ahc039/submissions/59660035, November 2024. AtCoder Heuristic Contest 039 submission page
-
[57]
Sakana ai agent wins atcoder heuristic contest (first ai to place 1st)
Sakana AI. Sakana ai agent wins atcoder heuristic contest (first ai to place 1st). https://sakana.ai/ ahc058/, 2026
work page 2026
-
[58]
Meta- learning with memory-augmented neural networks
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta- learning with memory-augmented neural networks. InInternational conference on machine learning, pages 1842–1850, 2016
work page 2016
-
[59]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,
Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393, 2022
-
[61]
Openevolve: an open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025
work page 2025
- [62]
-
[63]
David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Masteri...
work page 2016
-
[64]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018
work page 2018
-
[65]
Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017
work page 2017
-
[66]
Consistent nonparametric regression.The annals of statistics, pages 595–620, 1977
Charles J Stone. Consistent nonparametric regression.The annals of statistics, pages 595–620, 1977
work page 1977
-
[67]
Learning to (learn at test time).arXiv preprint arXiv:2310.13807, 2023
Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time).arXiv preprint arXiv:2310.13807, 2023
-
[68]
Test-time training with self-supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational Conference on Machine Learning, pages 9229–9248. PMLR, 2020
work page 2020
-
[69]
Algorithm discovery with llms: Evolutionary search meets reinforcement learning
Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with llms: Evolutionary search meets reinforcement learning. arXiv preprint arXiv:2504.05108, 2025
-
[70]
The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019
Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019
work page 2019
-
[71]
End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025
Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025
-
[72]
On a few pitfalls in kl divergence gradient estimation for rl.arXiv preprint arXiv:2506.09477, 2025
Yunhao Tang and Rémi Munos. On a few pitfalls in kl divergence gradient estimation for rl.arXiv preprint arXiv:2506.09477, 2025. 23
-
[73]
Triton: an intermediate language and compiler for tiled neural network computations
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019
work page 2019
-
[74]
Three scenarios for continual learning
Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[75]
Recovering gene interactions from single-cell data using data diffusion.Cell, 174(3):716–729, 2018
David Van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J Carr, Cassandra Burdziak, Kevin R Moon, Christine L Chaffer, Diwakar Pattabiraman, et al. Recovering gene interactions from single-cell data using data diffusion.Cell, 174(3):716–729, 2018
work page 2018
-
[76]
Aarthi Venkat, Scott E Youlten, Beatriz P San Juan, Carley A Purcell, Shabarni Gupta, Matthew Amodio, Daniel P Neumann, John G Lock, Anton E Westacott, Cerys S McCool, et al. Aanet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity.Cancer Discovery, 2025
work page 2025
-
[77]
Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362– 5383, 2024
work page 2024
-
[78]
Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025
Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025
-
[79]
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
A new bound for Erdős’ minimum overlap problem.Acta Arithmetica, 208:235–255, 2023
Ethan Patrick White. A new bound for Erdős’ minimum overlap problem.Acta Arithmetica, 208:235–255, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.