Recognition: 2 theorem links
· Lean TheoremPACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3
The pith
PACEvolve++ lets evolutionary search agents adapt their policy during test time using phase-adaptive reinforcement learning on an advisor model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-k frontier贡献 to 支持
What carries the argument
The phase-adaptive training strategy that begins with group-relative feedback for broad preferences and later shifts to best-of-k emphasis to maintain stability as reward signals become non-stationary.
If this is right
- Faster convergence on tasks where each candidate evaluation is expensive.
- Stabilized test-time training throughout the evolutionary process.
- Outperformance of prior frontier-model evolutionary search baselines on expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation.
- Reduced need for manual hyperparameter tuning when reward distributions change over the course of search.
Where Pith is reading between the lines
- The same phase-adaptive logic could be applied to other online optimization loops that face gradually compressing reward signals, such as iterative code improvement or molecular design.
- Decoupling strategy (advisor) from execution (frontier model) may allow smaller models to steer search while still benefiting from occasional high-quality execution steps.
- If the approach generalizes beyond the three domains, it suggests that test-time RL on search policies can serve as a lightweight alternative to full fine-tuning for specialized tasks.
- A natural next test would be to measure whether the learned advisor policy transfers across related but distinct search tasks without retraining.
Load-bearing premise
The phase-adaptive training strategy successfully handles non-stationary reward signals without introducing instability or requiring task-specific hyperparameter tuning.
What would settle it
A direct comparison on any of the three evaluated tasks in which PACEvolve++ either converges more slowly than the prior state-of-the-art evolutionary framework or exhibits clear training instability once reward gaps compress.
Figures
read the original abstract
Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-$k$ frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. It decouples a trainable advisor model (for hypothesis generation, assessment, and selection) from a stronger frontier model (for translating hypotheses into executable candidates). A phase-adaptive training strategy is proposed that switches from group-relative feedback early in evolution to best-of-k emphasis later to manage non-stationary rewards. The authors claim that PACEvolve++ outperforms state-of-the-art evolutionary search frameworks with frontier models on expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, achieving faster convergence and stabilizing test-time training.
Significance. If the performance gains are shown to be robust through controlled experiments, this work could meaningfully advance test-time adaptation techniques for LLM-driven evolutionary search in domains with costly evaluations. The advisor-frontier decoupling is a clean architectural choice that separates strategic learning from execution and may generalize beyond the reported tasks. The focus on handling non-stationary feedback during search is timely for practical engineering and scientific applications.
major comments (3)
- [Abstract] Abstract: The central claims of outperformance, faster convergence, and stabilized test-time training are asserted without any quantitative metrics, baseline names, effect sizes, or statistical tests, preventing evaluation of whether the data support the headline results.
- [Experiments] Experiments section: No ablation is presented that holds the advisor-frontier decoupling and compute budget fixed while varying only the phase-adaptive schedule (group-relative early vs. best-of-k later); without this isolation, it is impossible to attribute stability or gains specifically to the phase-adaptive strategy rather than other design elements.
- [Method] Method section: The phase-adaptive approach claims to avoid task-specific hyperparameter tuning, yet the manuscript provides no details on the phase-transition criterion (e.g., reward-gap threshold, iteration count, or variance-based trigger) or sensitivity analysis across the three domains.
minor comments (1)
- [Method] The term 'best-of-k frontier contribution' is used without an accompanying equation or pseudocode clarifying how the k candidates are selected and how their contribution is incorporated into the advisor's loss.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to strengthen the presentation of results and methodological transparency. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of outperformance, faster convergence, and stabilized test-time training are asserted without any quantitative metrics, baseline names, effect sizes, or statistical tests, preventing evaluation of whether the data support the headline results.
Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version, we will update the abstract to report key metrics such as convergence speed improvements (e.g., iterations to reach target performance), final performance gains over named baselines, effect sizes, and any statistical significance tests across the three domains. revision: yes
-
Referee: [Experiments] Experiments section: No ablation is presented that holds the advisor-frontier decoupling and compute budget fixed while varying only the phase-adaptive schedule (group-relative early vs. best-of-k later); without this isolation, it is impossible to attribute stability or gains specifically to the phase-adaptive strategy rather than other design elements.
Authors: This is a valid concern. While the current experiments compare the complete PACEvolve++ framework against baselines, they do not isolate the phase-adaptive schedule. We will add a controlled ablation in the revised manuscript that keeps the advisor-frontier decoupling and total compute budget fixed, varying only the training schedule to directly demonstrate its contribution to stability and performance gains. revision: yes
-
Referee: [Method] Method section: The phase-adaptive approach claims to avoid task-specific hyperparameter tuning, yet the manuscript provides no details on the phase-transition criterion (e.g., reward-gap threshold, iteration count, or variance-based trigger) or sensitivity analysis across the three domains.
Authors: We acknowledge the need for greater detail here. The current manuscript describes the high-level switch from group-relative to best-of-k emphasis but omits the exact transition rule and sensitivity checks. In the revision, we will specify the phase-transition criterion (including the precise trigger used, such as a reward variance threshold) and add sensitivity analysis results showing robustness of the chosen transition point across the load balancing, recommendation, and protein tasks. revision: yes
Circularity Check
No circularity: purely empirical claims without derivations
full rationale
The paper presents PACEvolve++ as an empirical RL framework for test-time adaptation, with performance claims based on experimental comparisons to external baselines across three domains. No equations, first-principles derivations, or predictions are offered that could reduce to fitted inputs or self-definitions by construction. The phase-adaptive strategy is described as a design choice whose contribution is evaluated via overall results rather than isolated as a tautological output. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the central claims.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearphase-adaptive approach that adapts its optimization strategy... Early... group-relative feedback... later... best-of-k frontier contribution... Amix_i(t) = (1−α_t) Ã^G_i + α_t Ã^top-k_i
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclearTheorem 1 (Scale-conditioned credit assignment under reward compression)
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
Language agents mirror human causal reasoning biases
GX-Chen Anthony, Dongyan Lin, Mandana Samiei, Doina Precup, Blake Aaron Richards, Rob Fergus, and Kenneth Marino. Language agents mirror human causal reasoning biases. how can we help them think like scientists? InSecond Conference on Language Modeling, 2025
2025
-
[3]
Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box llms with advisor models.arXiv preprint arXiv:2510.02453, 2025
-
[4]
Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025
-
[5]
Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adaevolve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
-
[6]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024
work page Pith review arXiv 2024
-
[7]
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025
-
[8]
Barbarians at the gate: How AI is upending systems research
Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, et al. Barbarians at the gate: How ai is upending systems research.arXiv preprint arXiv:2510.06189, 2025
-
[9]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[11]
An evolutionary approach to the traveling salesman problem.Biological Cybernetics, 60(2):139–144, 1988
David B Fogel. An evolutionary approach to the traveling salesman problem.Biological Cybernetics, 60(2):139–144, 1988
1988
-
[12]
Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385, 2022
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385, 2022
2022
-
[13]
Kuairec: A fully-observed dataset and insights for evaluating recommender systems
Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 540–550, 2022. 10
2022
-
[14]
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, 2017
-
[15]
Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165, 2026
-
[16]
Genetic algorithms.Scientific american, 267(1):66–73, 1992
John H Holland. Genetic algorithms.Scientific american, 267(1):66–73, 1992
1992
-
[17]
Automated antenna design with evolutionary algorithms
Gregory Hornby, Al Globus, Derek Linden, and Jason Lohn. Automated antenna design with evolutionary algorithms. InSpace 2006, page 7242. 2006
2006
-
[18]
Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025
-
[19]
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025
-
[20]
Drift analysis
Johannes Lengler. Drift analysis. InTheory of evolutionary computation: Recent developments in discrete optimization, pages 89–131. Springer, 2019
2019
-
[21]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
InInternational Conference on Machine Learning (ICML)
Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fit- ness landscape of large language model-assisted automated algorithm search.arXiv preprint arXiv:2504.19636, 2025
-
[23]
Evox: Meta-evolution for automated discovery
Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026
-
[24]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
Evolve: Evaluating and optimizing llms for exploration.arXiv preprint arXiv:2410.06238, 2024
Allen Nie, Yi Su, Bo Chang, Jonathan N Lee, Ed H Chi, Quoc V Le, and Minmin Chen. Evolve: Evaluating and optimizing llms for exploration.arXiv preprint arXiv:2410.06238, 2024
-
[26]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
arXiv preprint arXiv:2602.06717 , year=
Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-grpo: Don’t let your policy learn the obvious and forget the rare.arXiv preprint arXiv:2602.06717, 2026
-
[28]
Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, and Bo Dai. Mle-smith: Scaling mle tasks with automated multi-agent pipeline.arXiv preprint arXiv:2510.07307, 2025
-
[29]
Mathematical discoveries from program search with large language models
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024
2024
-
[30]
Exploring protein fitness landscapes by directed evolution.Nature reviews Molecular cell biology, 10(12):866–876, 2009
Philip A Romero and Frances H Arnold. Exploring protein fitness landscapes by directed evolution.Nature reviews Molecular cell biology, 10(12):866–876, 2009
2009
-
[31]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 11
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Openevolve: an open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025
2025
-
[34]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Llm-sr: Scientific equation discovery via programming with large language models
Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400, 2024
-
[36]
Learning to (Learn at Test Time):
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024
-
[37]
Test-time training with self-supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020
2020
-
[38]
End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025
Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025
-
[39]
Gemini 3, Nov 2025
Gemini 3 Team. Gemini 3, Nov 2025
2025
-
[40]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Rapid directed evolution guided by protein language models and epistatic interactions.Science, page eaea1820, 2026
Vincent Q Tran, Matthew Nemeth, Liam J Bartie, Sita S Chandrasekaran, Alison Fanton, Hyungseok C Moon, Brian L Hie, Silvana Konermann, and Patrick D Hsu. Rapid directed evolution guided by protein language models and epistatic interactions.Science, page eaea1820, 2026
2026
-
[42]
Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforce- ment learning problems.arXiv preprint arXiv:2505.15201, 2025
-
[43]
Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems
Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of the web conference 2021, pages 1785–1797, 2021
2021
-
[44]
Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025
Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems. arXiv preprint arXiv:2511.23473, 2025
-
[45]
Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, et al. Pacevolve: Enabling long-horizon progress-aware consistent evolution.arXiv preprint arXiv:2601.10657, 2026
-
[46]
Programbench: Can language models rebuild programs from scratch?, 2026
John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. Programbench: Can language models rebuild programs from scratch?, 2026
2026
-
[47]
Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025
Sherry Yang, Joy He-Yueya, and Percy Liang. Reinforcement learning for machine learning engineering agents.arXiv preprint arXiv:2509.01684, 2025
-
[48]
Yufei Ye, Wei Guo, Hao Wang, Luankang Zhang, Heng Chang, Hong Zhu, Yuyang Ye, Yong Liu, Defu Lian, and Enhong Chen. Fuxi-linear: Unleashing the power of linear attention in long-term time-aware sequential recommendation.arXiv preprint arXiv:2602.23671, 2026. 12
-
[49]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Learning to discover at test time
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time. arXiv preprint arXiv:2601.16175, 2026
-
[51]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review arXiv 2026
-
[52]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations.arXiv preprint arXiv:2402.17152, 2024
work page internal anchor Pith review arXiv 2024
-
[53]
Wukong: Towards a scaling law for large-scale recommendation,
Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv preprint arXiv:2403.02545, 2024
-
[54]
arXiv preprint arXiv:2603.01162v3 , year=
Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162, 2026
-
[55]
Bars: Towards open benchmarking for recommender systems
Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. Bars: Towards open benchmarking for recommender systems. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2912–2923, 2022
2022
-
[56]
Open benchmarking for click-through rate prediction
Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. InProceedings of the 30th ACM international conference on information & knowledge management, pages 2759–2769, 2021
2021
-
[57]
Where llm agents fail and how they can learn from failures,
Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
-
[58]
TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 13 A Limitations Due to the high costs of both RL training and evolutionary search and limited resources, exacerbated by the fact that evaluating each evol...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.