pith. machine review for the scientific record. sign in

arxiv: 2605.07039 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

Beidou Wang, Benjamin Coleman, Bo Peng, Derek Zhiyuan Cheng, Ed H. Chi, Minghao Yan, Noveen Sachdeva, Shivaram Venkataraman, Shuo Chen, Wang-Cheng Kang, Weili Wang, Zhankui He, Zhouhang Xie, Ziqi Chen

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords evolutionary searchtest-time adaptationreinforcement learningadvisor modelphase-adaptive trainingnon-stationary rewardslarge language modelspolicy learning
0
0 comments X

The pith

PACEvolve++ lets evolutionary search agents adapt their policy during test time using phase-adaptive reinforcement learning on an advisor model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard evolutionary search with large language models relies on fixed prompts that cannot adjust to task-specific dynamics when evaluations are costly. PACEvolve++ separates high-level strategy from low-level execution: a trainable advisor model proposes, evaluates, and picks hypotheses while a frontier model converts the chosen ones into concrete candidates. To train the advisor amid shifting rewards, the framework switches from group-relative feedback early (to learn broad preferences) to best-of-k emphasis later (to stabilize refinement once gaps narrow). Experiments across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation show faster convergence and more stable test-time training than prior evolutionary frameworks.

Core claim

We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-k frontier贡献 to 支持

What carries the argument

The phase-adaptive training strategy that begins with group-relative feedback for broad preferences and later shifts to best-of-k emphasis to maintain stability as reward signals become non-stationary.

If this is right

  • Faster convergence on tasks where each candidate evaluation is expensive.
  • Stabilized test-time training throughout the evolutionary process.
  • Outperformance of prior frontier-model evolutionary search baselines on expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation.
  • Reduced need for manual hyperparameter tuning when reward distributions change over the course of search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phase-adaptive logic could be applied to other online optimization loops that face gradually compressing reward signals, such as iterative code improvement or molecular design.
  • Decoupling strategy (advisor) from execution (frontier model) may allow smaller models to steer search while still benefiting from occasional high-quality execution steps.
  • If the approach generalizes beyond the three domains, it suggests that test-time RL on search policies can serve as a lightweight alternative to full fine-tuning for specialized tasks.
  • A natural next test would be to measure whether the learned advisor policy transfers across related but distinct search tasks without retraining.

Load-bearing premise

The phase-adaptive training strategy successfully handles non-stationary reward signals without introducing instability or requiring task-specific hyperparameter tuning.

What would settle it

A direct comparison on any of the three evaluated tasks in which PACEvolve++ either converges more slowly than the prior state-of-the-art evolutionary framework or exhibits clear training instability once reward gaps compress.

Figures

Figures reproduced from arXiv: 2605.07039 by Beidou Wang, Benjamin Coleman, Bo Peng, Derek Zhiyuan Cheng, Ed H. Chi, Minghao Yan, Noveen Sachdeva, Shivaram Venkataraman, Shuo Chen, Wang-Cheng Kang, Weili Wang, Zhankui He, Zhouhang Xie, Ziqi Chen.

Figure 1
Figure 1. Figure 1: Overall PACEvolve++ workflow. A train￾able advisor handles idea generation, novelty as￾sessment, and hypothesis selection, while a frontier implementation model writes code. The RL objec￾tive is coupled to rollout batches and adapts its credit assignment to the search phase. We introduce a dedicated advisor model [3] to make search-specific policy adaptation explicit. The advisor learns the strategic decis… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics for DeepSeek-R1-0528-Qwen3-8B on Multi-Evolve. PACEvolve++ [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different RL algorithms on DeepSeek-R1-0528-Qwen3-8B across three [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different RL algorithms on Qwen3.5-4B across three tasks. PACEvolve++ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-Evolve training dynamics of 8B models. ThetaEvolve exhibits large gradient-norm [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-Evolve training dynamics of 4B models. PACEvolve++ remains the most stable on [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: KuaiRec training dynamics of 8B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: KuaiRec training dynamics of 4B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: EPLB training dynamics of 8B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: EPLB training dynamics of 4B models. ThetaEvolve exhibits large gradient-norm spikes, [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-$k$ frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. It decouples a trainable advisor model (for hypothesis generation, assessment, and selection) from a stronger frontier model (for translating hypotheses into executable candidates). A phase-adaptive training strategy is proposed that switches from group-relative feedback early in evolution to best-of-k emphasis later to manage non-stationary rewards. The authors claim that PACEvolve++ outperforms state-of-the-art evolutionary search frameworks with frontier models on expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, achieving faster convergence and stabilizing test-time training.

Significance. If the performance gains are shown to be robust through controlled experiments, this work could meaningfully advance test-time adaptation techniques for LLM-driven evolutionary search in domains with costly evaluations. The advisor-frontier decoupling is a clean architectural choice that separates strategic learning from execution and may generalize beyond the reported tasks. The focus on handling non-stationary feedback during search is timely for practical engineering and scientific applications.

major comments (3)
  1. [Abstract] Abstract: The central claims of outperformance, faster convergence, and stabilized test-time training are asserted without any quantitative metrics, baseline names, effect sizes, or statistical tests, preventing evaluation of whether the data support the headline results.
  2. [Experiments] Experiments section: No ablation is presented that holds the advisor-frontier decoupling and compute budget fixed while varying only the phase-adaptive schedule (group-relative early vs. best-of-k later); without this isolation, it is impossible to attribute stability or gains specifically to the phase-adaptive strategy rather than other design elements.
  3. [Method] Method section: The phase-adaptive approach claims to avoid task-specific hyperparameter tuning, yet the manuscript provides no details on the phase-transition criterion (e.g., reward-gap threshold, iteration count, or variance-based trigger) or sensitivity analysis across the three domains.
minor comments (1)
  1. [Method] The term 'best-of-k frontier contribution' is used without an accompanying equation or pseudocode clarifying how the k candidates are selected and how their contribution is incorporated into the advisor's loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to strengthen the presentation of results and methodological transparency. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of outperformance, faster convergence, and stabilized test-time training are asserted without any quantitative metrics, baseline names, effect sizes, or statistical tests, preventing evaluation of whether the data support the headline results.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version, we will update the abstract to report key metrics such as convergence speed improvements (e.g., iterations to reach target performance), final performance gains over named baselines, effect sizes, and any statistical significance tests across the three domains. revision: yes

  2. Referee: [Experiments] Experiments section: No ablation is presented that holds the advisor-frontier decoupling and compute budget fixed while varying only the phase-adaptive schedule (group-relative early vs. best-of-k later); without this isolation, it is impossible to attribute stability or gains specifically to the phase-adaptive strategy rather than other design elements.

    Authors: This is a valid concern. While the current experiments compare the complete PACEvolve++ framework against baselines, they do not isolate the phase-adaptive schedule. We will add a controlled ablation in the revised manuscript that keeps the advisor-frontier decoupling and total compute budget fixed, varying only the training schedule to directly demonstrate its contribution to stability and performance gains. revision: yes

  3. Referee: [Method] Method section: The phase-adaptive approach claims to avoid task-specific hyperparameter tuning, yet the manuscript provides no details on the phase-transition criterion (e.g., reward-gap threshold, iteration count, or variance-based trigger) or sensitivity analysis across the three domains.

    Authors: We acknowledge the need for greater detail here. The current manuscript describes the high-level switch from group-relative to best-of-k emphasis but omits the exact transition rule and sensitivity checks. In the revision, we will specify the phase-transition criterion (including the precise trigger used, such as a reward variance threshold) and add sensitivity analysis results showing robustness of the chosen transition point across the load balancing, recommendation, and protein tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims without derivations

full rationale

The paper presents PACEvolve++ as an empirical RL framework for test-time adaptation, with performance claims based on experimental comparisons to external baselines across three domains. No equations, first-principles derivations, or predictions are offered that could reduce to fitted inputs or self-definitions by construction. The phase-adaptive strategy is described as a design choice whose contribution is evaluated via overall results rather than isolated as a tautological output. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, fitted constants, or newly postulated entities. It relies on standard assumptions from reinforcement learning and large-language-model capabilities that are treated as given from prior literature.

pith-pipeline@v0.9.0 · 5544 in / 1174 out tokens · 67640 ms · 2026-05-11T00:50:14.911863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

58 extracted references · 40 canonical work pages · 12 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  2. [2]

    Language agents mirror human causal reasoning biases

    GX-Chen Anthony, Dongyan Lin, Mandana Samiei, Doina Precup, Blake Aaron Richards, Rob Fergus, and Kenneth Marino. Language agents mirror human causal reasoning biases. how can we help them think like scientists? InSecond Conference on Language Modeling, 2025

  3. [3]

    Dimakis, and Joseph E

    Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box llms with advisor models.arXiv preprint arXiv:2510.02453, 2025

  4. [4]

    Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

  5. [5]

    Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

  6. [6]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

  7. [7]

    Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

  8. [8]

    Barbarians at the gate: How AI is upending systems research

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research.arXiv preprint arXiv:2510.06189, 2025

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  10. [10]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  11. [11]

    An evolutionary approach to the traveling salesman problem.Biological Cybernetics, 60(2):139–144, 1988

    David B Fogel. An evolutionary approach to the traveling salesman problem.Biological Cybernetics, 60(2):139–144, 1988

  12. [12]

    Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385, 2022

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385, 2022

  13. [13]

    Kuairec: A fully-observed dataset and insights for evaluating recommender systems

    Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022. 10

  14. [14]

    Deepfm: a factorization-machine based neural network for ctr predic- tion.arXiv preprint arXiv:1703.04247,

    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, 2017

  15. [15]

    Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

    Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165, 2026

  16. [16]

    Genetic algorithms.Scientific american, 267(1):66–73, 1992

    John H Holland. Genetic algorithms.Scientific american, 267(1):66–73, 1992

  17. [17]

    Automated antenna design with evolutionary algorithms

    Gregory Hornby, Al Globus, Derek Linden, and Jason Lohn. Automated antenna design with evolutionary algorithms. InSpace 2006, page 7242. 2006

  18. [18]

    Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

    Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

  19. [19]

    T., Imajuku, Y., and Cetin, E

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

  20. [20]

    Drift analysis

    Johannes Lengler. Drift analysis. InTheory of evolutionary computation: Recent developments in discrete optimization, pages 89–131. Springer, 2019

  21. [21]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  22. [22]

    InInternational Conference on Machine Learning (ICML)

    Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fit- ness landscape of large language model-assisted automated algorithm search.arXiv preprint arXiv:2504.19636, 2025

  23. [23]

    Evox: Meta-evolution for automated discovery

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

  24. [24]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  25. [25]

    Evolve: Evaluating and optimizing llms for exploration.arXiv preprint arXiv:2410.06238, 2024

    Allen Nie, Yi Su, Bo Chang, Jonathan N Lee, Ed H Chi, Quoc V Le, and Minmin Chen. Evolve: Evaluating and optimizing llms for exploration.arXiv preprint arXiv:2410.06238, 2024

  26. [26]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  27. [27]

    arXiv preprint arXiv:2602.06717 , year=

    Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-grpo: Don’t let your policy learn the obvious and forget the rare.arXiv preprint arXiv:2602.06717, 2026

  28. [28]

    Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

    Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, and Bo Dai. Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025

  29. [29]

    Mathematical discoveries from program search with large language models

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

  30. [30]

    Exploring protein fitness landscapes by directed evolution.Nature reviews Molecular cell biology, 10(12):866–876, 2009

    Philip A Romero and Frances H Arnold. Exploring protein fitness landscapes by directed evolution.Nature reviews Molecular cell biology, 10(12):866–876, 2009

  31. [31]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 11

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

  34. [34]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  35. [35]

    Llm-sr: Scientific equation discovery via programming with large language models

    Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400, 2024

  36. [36]

    Learning to (Learn at Test Time):

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

  37. [37]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020

  38. [38]

    End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

    Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

  39. [39]

    Gemini 3, Nov 2025

    Gemini 3 Team. Gemini 3, Nov 2025

  40. [40]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  41. [41]

    Rapid directed evolution guided by protein language models and epistatic interactions.Science, page eaea1820, 2026

    Vincent Q Tran, Matthew Nemeth, Liam J Bartie, Sita S Chandrasekaran, Alison Fanton, Hyungseok C Moon, Brian L Hie, Silvana Konermann, and Patrick D Hsu. Rapid directed evolution guided by protein language models and epistatic interactions.Science, page eaea1820, 2026

  42. [42]

    Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025

    Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025

  43. [43]

    Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems

    Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of the web conference 2021, pages 1785–1797, 2021

  44. [44]

    Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems. arXiv preprint arXiv:2511.23473, 2025

  45. [45]

    Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026

    Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, et al. Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026

  46. [46]

    Programbench: Can language models rebuild programs from scratch?, 2026

    John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. Programbench: Can language models rebuild programs from scratch?, 2026

  47. [47]

    Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

    Sherry Yang, Joy He-Yueya, and Percy Liang. Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025

  48. [48]

    Fuxi-linear: Unleashing the power of linear attention in long-term time-aware sequential recommendation.arXiv preprint arXiv:2602.23671, 2026

    Yufei Ye, Wei Guo, Hao Wang, Luankang Zhang, Heng Chang, Hong Zhu, Yuyang Ye, Yong Liu, Defu Lian, and Enhong Chen. Fuxi-linear: Unleashing the power of linear attention in long-term time-aware sequential recommendation.arXiv preprint arXiv:2602.23671, 2026. 12

  49. [49]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  50. [50]

    Learning to discover at test time

    Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time. arXiv preprint arXiv:2601.16175, 2026

  51. [51]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  52. [52]

    Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

    Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024

  53. [53]

    Wukong: Towards a scaling law for large-scale recommendation,

    Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv preprint arXiv:2403.02545, 2024

  54. [54]

    arXiv preprint arXiv:2603.01162v3 , year=

    Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162, 2026

  55. [55]

    Bars: Towards open benchmarking for recommender systems

    Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. Bars: Towards open benchmarking for recommender systems. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2912–2923, 2022

  56. [56]

    Open benchmarking for click-through rate prediction

    Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. InProceedings of the 30th ACM international conference on information & knowledge management, pages 2759–2769, 2021

  57. [57]

    Where llm agents fail and how they can learn from failures,

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  58. [58]

    TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 13 A Limitations Due to the high costs of both RL training and evolutionary search and limited resources, exacerbated by the fact that evaluating each evol...