arxiv: 2605.07039 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

Beidou Wang, Benjamin Coleman, Bo Peng, Derek Zhiyuan Cheng, Ed H. Chi, Minghao Yan, Noveen Sachdeva, Shivaram Venkataraman, Shuo Chen, Wang-Cheng Kang, Weili Wang, Zhankui He, Zhouhang Xie, Ziqi Chen

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords evolutionary searchtest-time adaptationreinforcement learningadvisor modelphase-adaptive trainingnon-stationary rewardslarge language modelspolicy learning

0 comments

The pith

PACEvolve++ lets evolutionary search agents adapt their policy during test time using phase-adaptive reinforcement learning on an advisor model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard evolutionary search with large language models relies on fixed prompts that cannot adjust to task-specific dynamics when evaluations are costly. PACEvolve++ separates high-level strategy from low-level execution: a trainable advisor model proposes, evaluates, and picks hypotheses while a frontier model converts the chosen ones into concrete candidates. To train the advisor amid shifting rewards, the framework switches from group-relative feedback early (to learn broad preferences) to best-of-k emphasis later (to stabilize refinement once gaps narrow). Experiments across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation show faster convergence and more stable test-time training than prior evolutionary frameworks.

Core claim

We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-k frontier贡献 to 支持

What carries the argument

The phase-adaptive training strategy that begins with group-relative feedback for broad preferences and later shifts to best-of-k emphasis to maintain stability as reward signals become non-stationary.

If this is right

Faster convergence on tasks where each candidate evaluation is expensive.
Stabilized test-time training throughout the evolutionary process.
Outperformance of prior frontier-model evolutionary search baselines on expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation.
Reduced need for manual hyperparameter tuning when reward distributions change over the course of search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phase-adaptive logic could be applied to other online optimization loops that face gradually compressing reward signals, such as iterative code improvement or molecular design.
Decoupling strategy (advisor) from execution (frontier model) may allow smaller models to steer search while still benefiting from occasional high-quality execution steps.
If the approach generalizes beyond the three domains, it suggests that test-time RL on search policies can serve as a lightweight alternative to full fine-tuning for specialized tasks.
A natural next test would be to measure whether the learned advisor policy transfers across related but distinct search tasks without retraining.

Load-bearing premise

The phase-adaptive training strategy successfully handles non-stationary reward signals without introducing instability or requiring task-specific hyperparameter tuning.

What would settle it

A direct comparison on any of the three evaluated tasks in which PACEvolve++ either converges more slowly than the prior state-of-the-art evolutionary framework or exhibits clear training instability once reward gaps compress.

Figures

Figures reproduced from arXiv: 2605.07039 by Beidou Wang, Benjamin Coleman, Bo Peng, Derek Zhiyuan Cheng, Ed H. Chi, Minghao Yan, Noveen Sachdeva, Shivaram Venkataraman, Shuo Chen, Wang-Cheng Kang, Weili Wang, Zhankui He, Zhouhang Xie, Ziqi Chen.

**Figure 1.** Figure 1: Overall PACEvolve++ workflow. A trainable advisor handles idea generation, novelty assessment, and hypothesis selection, while a frontier implementation model writes code. The RL objective is coupled to rollout batches and adapts its credit assignment to the search phase. We introduce a dedicated advisor model [3] to make search-specific policy adaptation explicit. The advisor learns the strategic decis… view at source ↗

**Figure 2.** Figure 2: Training dynamics for DeepSeek-R1-0528-Qwen3-8B on Multi-Evolve. PACEvolve++ [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of different RL algorithms on DeepSeek-R1-0528-Qwen3-8B across three [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of different RL algorithms on Qwen3.5-4B across three tasks. PACEvolve++ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-Evolve training dynamics of 8B models. ThetaEvolve exhibits large gradient-norm [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-Evolve training dynamics of 4B models. PACEvolve++ remains the most stable on [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: KuaiRec training dynamics of 8B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: KuaiRec training dynamics of 4B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: EPLB training dynamics of 8B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: EPLB training dynamics of 4B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-$k$ frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PACEvolve++ adds a decoupled advisor with phase-adaptive RL training to LLM evolutionary search and reports faster convergence on three applied tasks, but the adaptive schedule is not isolated from the rest of the design.

read the letter

The paper's main move is to split evolutionary search into a trainable advisor that picks and scores hypotheses and a frontier model that turns those into actual candidates. It then trains the advisor with a changing objective: group-relative feedback early when the search is broad, then best-of-k later when rewards get closer together. That phase switch is presented as the fix for non-stationary signals during test-time adaptation.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. It decouples a trainable advisor model (for hypothesis generation, assessment, and selection) from a stronger frontier model (for translating hypotheses into executable candidates). A phase-adaptive training strategy is proposed that switches from group-relative feedback early in evolution to best-of-k emphasis later to manage non-stationary rewards. The authors claim that PACEvolve++ outperforms state-of-the-art evolutionary search frameworks with frontier models on expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, achieving faster convergence and stabilizing test-time training.

Significance. If the performance gains are shown to be robust through controlled experiments, this work could meaningfully advance test-time adaptation techniques for LLM-driven evolutionary search in domains with costly evaluations. The advisor-frontier decoupling is a clean architectural choice that separates strategic learning from execution and may generalize beyond the reported tasks. The focus on handling non-stationary feedback during search is timely for practical engineering and scientific applications.

major comments (3)

[Abstract] Abstract: The central claims of outperformance, faster convergence, and stabilized test-time training are asserted without any quantitative metrics, baseline names, effect sizes, or statistical tests, preventing evaluation of whether the data support the headline results.
[Experiments] Experiments section: No ablation is presented that holds the advisor-frontier decoupling and compute budget fixed while varying only the phase-adaptive schedule (group-relative early vs. best-of-k later); without this isolation, it is impossible to attribute stability or gains specifically to the phase-adaptive strategy rather than other design elements.
[Method] Method section: The phase-adaptive approach claims to avoid task-specific hyperparameter tuning, yet the manuscript provides no details on the phase-transition criterion (e.g., reward-gap threshold, iteration count, or variance-based trigger) or sensitivity analysis across the three domains.

minor comments (1)

[Method] The term 'best-of-k frontier contribution' is used without an accompanying equation or pseudocode clarifying how the k candidates are selected and how their contribution is incorporated into the advisor's loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to strengthen the presentation of results and methodological transparency. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of outperformance, faster convergence, and stabilized test-time training are asserted without any quantitative metrics, baseline names, effect sizes, or statistical tests, preventing evaluation of whether the data support the headline results.

Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version, we will update the abstract to report key metrics such as convergence speed improvements (e.g., iterations to reach target performance), final performance gains over named baselines, effect sizes, and any statistical significance tests across the three domains. revision: yes
Referee: [Experiments] Experiments section: No ablation is presented that holds the advisor-frontier decoupling and compute budget fixed while varying only the phase-adaptive schedule (group-relative early vs. best-of-k later); without this isolation, it is impossible to attribute stability or gains specifically to the phase-adaptive strategy rather than other design elements.

Authors: This is a valid concern. While the current experiments compare the complete PACEvolve++ framework against baselines, they do not isolate the phase-adaptive schedule. We will add a controlled ablation in the revised manuscript that keeps the advisor-frontier decoupling and total compute budget fixed, varying only the training schedule to directly demonstrate its contribution to stability and performance gains. revision: yes
Referee: [Method] Method section: The phase-adaptive approach claims to avoid task-specific hyperparameter tuning, yet the manuscript provides no details on the phase-transition criterion (e.g., reward-gap threshold, iteration count, or variance-based trigger) or sensitivity analysis across the three domains.

Authors: We acknowledge the need for greater detail here. The current manuscript describes the high-level switch from group-relative to best-of-k emphasis but omits the exact transition rule and sensitivity checks. In the revision, we will specify the phase-transition criterion (including the precise trigger used, such as a reward variance threshold) and add sensitivity analysis results showing robustness of the chosen transition point across the load balancing, recommendation, and protein tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims without derivations

full rationale

The paper presents PACEvolve++ as an empirical RL framework for test-time adaptation, with performance claims based on experimental comparisons to external baselines across three domains. No equations, first-principles derivations, or predictions are offered that could reduce to fitted inputs or self-definitions by construction. The phase-adaptive strategy is described as a design choice whose contribution is evaluated via overall results rather than isolated as a tautological output. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, fitted constants, or newly postulated entities. It relies on standard assumptions from reinforcement learning and large-language-model capabilities that are treated as given from prior literature.

pith-pipeline@v0.9.0 · 5544 in / 1174 out tokens · 67640 ms · 2026-05-11T00:50:14.911863+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
phase-adaptive approach that adapts its optimization strategy... Early... group-relative feedback... later... best-of-k frontier contribution... Amix_i(t) = (1−α_t) Ã^G_i + α_t Ã^top-k_i
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
Theorem 1 (Scale-conditioned credit assignment under reward compression)

Reference graph

Works this paper leans on

58 extracted references · 40 canonical work pages · 12 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review arXiv 2025
[2]

Language agents mirror human causal reasoning biases

GX-Chen Anthony, Dongyan Lin, Mandana Samiei, Doina Precup, Blake Aaron Richards, Rob Fergus, and Kenneth Marino. Language agents mirror human causal reasoning biases. how can we help them think like scientists? InSecond Conference on Language Modeling, 2025

2025
[3]

Dimakis, and Joseph E

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box llms with advisor models.arXiv preprint arXiv:2510.02453, 2025

work page arXiv 2025
[4]

Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

work page arXiv 2025
[5]

Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

work page arXiv 2026
[6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[7]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025
[8]

Barbarians at the gate: How AI is upending systems research

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research.arXiv preprint arXiv:2510.06189, 2025

work page arXiv 2025
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[11]

An evolutionary approach to the traveling salesman problem.Biological Cybernetics, 60(2):139–144, 1988

David B Fogel. An evolutionary approach to the traveling salesman problem.Biological Cybernetics, 60(2):139–144, 1988

1988
[12]

Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385, 2022

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385, 2022

2022
[13]

Kuairec: A fully-observed dataset and insights for evaluating recommender systems

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022. 10

2022
[14]

Deepfm: a factorization-machine based neural network for ctr predic- tion.arXiv preprint arXiv:1703.04247,

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, 2017

work page arXiv 2017
[15]

Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165, 2026

work page arXiv 2026
[16]

Genetic algorithms.Scientific american, 267(1):66–73, 1992

John H Holland. Genetic algorithms.Scientific american, 267(1):66–73, 1992

1992
[17]

Automated antenna design with evolutionary algorithms

Gregory Hornby, Al Globus, Derek Linden, and Jason Lohn. Automated antenna design with evolutionary algorithms. InSpace 2006, page 7242. 2006

2006
[18]

Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

work page arXiv 2025
[19]

T., Imajuku, Y., and Cetin, E

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

work page arXiv 2025
[20]

Drift analysis

Johannes Lengler. Drift analysis. InTheory of evolutionary computation: Recent developments in discrete optimization, pages 89–131. Springer, 2019

2019
[21]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

InInternational Conference on Machine Learning (ICML)

Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fit- ness landscape of large language model-assisted automated algorithm search.arXiv preprint arXiv:2504.19636, 2025

work page arXiv 2025
[23]

Evox: Meta-evolution for automated discovery

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

work page arXiv 2026
[24]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review arXiv 2025
[25]

Evolve: Evaluating and optimizing llms for exploration.arXiv preprint arXiv:2410.06238, 2024

Allen Nie, Yi Su, Bo Chang, Jonathan N Lee, Ed H Chi, Quoc V Le, and Minmin Chen. Evolve: Evaluating and optimizing llms for exploration.arXiv preprint arXiv:2410.06238, 2024

work page arXiv 2024
[26]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

arXiv preprint arXiv:2602.06717 , year=

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-grpo: Don’t let your policy learn the obvious and forget the rare.arXiv preprint arXiv:2602.06717, 2026

work page arXiv 2026
[28]

Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, and Bo Dai. Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

work page arXiv 2025
[29]

Mathematical discoveries from program search with large language models

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

2024
[30]

Exploring protein fitness landscapes by directed evolution.Nature reviews Molecular cell biology, 10(12):866–876, 2009

Philip A Romero and Frances H Arnold. Exploring protein fitness landscapes by directed evolution.Nature reviews Molecular cell biology, 10(12):866–876, 2009

2009
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

2025
[34]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Llm-sr: Scientific equation discovery via programming with large language models

Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400, 2024

work page arXiv 2024
[36]

Learning to (Learn at Test Time):

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

work page arXiv 2024
[37]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020

2020
[38]

End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

work page arXiv 2025
[39]

Gemini 3, Nov 2025

Gemini 3 Team. Gemini 3, Nov 2025

2025
[40]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Rapid directed evolution guided by protein language models and epistatic interactions.Science, page eaea1820, 2026

Vincent Q Tran, Matthew Nemeth, Liam J Bartie, Sita S Chandrasekaran, Alison Fanton, Hyungseok C Moon, Brian L Hie, Silvana Konermann, and Patrick D Hsu. Rapid directed evolution guided by protein language models and epistatic interactions.Science, page eaea1820, 2026

2026
[42]

Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025

work page arXiv 2025
[43]

Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems

Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of the web conference 2021, pages 1785–1797, 2021

2021
[44]

Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems. arXiv preprint arXiv:2511.23473, 2025

work page arXiv 2025
[45]

Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026

Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, et al. Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026

work page arXiv 2026
[46]

Programbench: Can language models rebuild programs from scratch?, 2026

John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. Programbench: Can language models rebuild programs from scratch?, 2026

2026
[47]

Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

Sherry Yang, Joy He-Yueya, and Percy Liang. Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

work page arXiv 2025
[48]

Fuxi-linear: Unleashing the power of linear attention in long-term time-aware sequential recommendation.arXiv preprint arXiv:2602.23671, 2026

Yufei Ye, Wei Guo, Hao Wang, Luankang Zhang, Heng Chang, Hong Zhu, Yuyang Ye, Yong Liu, Defu Lian, and Enhong Chen. Fuxi-linear: Unleashing the power of linear attention in long-term time-aware sequential recommendation.arXiv preprint arXiv:2602.23671, 2026. 12

work page arXiv 2026
[49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Learning to discover at test time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time. arXiv preprint arXiv:2601.16175, 2026

work page arXiv 2026
[51]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review arXiv 2026
[52]

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

work page internal anchor Pith review arXiv 2024
[53]

Wukong: Towards a scaling law for large-scale recommendation,

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv preprint arXiv:2403.02545, 2024

work page arXiv 2024
[54]

arXiv preprint arXiv:2603.01162v3 , year=

Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162, 2026

work page arXiv 2026
[55]

Bars: Towards open benchmarking for recommender systems

Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. Bars: Towards open benchmarking for recommender systems. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2912–2923, 2022

2022
[56]

Open benchmarking for click-through rate prediction

Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. InProceedings of the 30th ACM international conference on information & knowledge management, pages 2759–2769, 2021

2021
[57]

Where llm agents fail and how they can learn from failures,

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025
[58]

TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 13 A Limitations Due to the high costs of both RL training and evolutionary search and limited resources, exacerbated by the fact that evaluating each evol...

work page arXiv 2025