pith. sign in

arxiv: 2605.28814 · v1 · pith:KSNSDPLVnew · submitted 2026-05-27 · 💻 cs.CL

Self-Improving Language Models with Bidirectional Evolutionary Search

Pith reviewed 2026-06-29 12:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords bidirectional evolutionary searchself-improving language modelsevolutionary operatorsgoal decompositionpost-traininginference-time searchproblem solving benchmarksentropy shell
0
0 comments X

The pith

Bidirectional evolutionary search lets language models generate better candidates by recombining partial trajectories and decomposing tasks into checkable subgoals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bidirectional Evolutionary Search (BES) to overcome two limits of existing search methods for self-improving language models: sparse verification signals and confinement of candidates to high-probability regions through autoregressive expansion. BES pairs forward evolution, which recombines partial trajectories to create new candidates, with backward search that recursively decomposes the original task into verifiable subgoals. Theory indicates that expansion-only methods stay inside a narrow entropy shell while evolutionary operators escape it, and that backward decomposition can exponentially cut the samples needed for a correct answer. Experiments report consistent gains on post-training tasks where other algorithms produce no improvement and superior average and best-case results on three open problem-solving benchmarks at inference time.

Core claim

BES couples forward candidate evolution with backward goal decomposition. In the forward direction, standard expansion is augmented by evolution operators that recombine partial trajectories to generate candidates difficult to obtain from a single model rollout. In the backward direction, the original task is recursively decomposed into checkable subgoals that supply dense intermediate feedback to guide the forward search. This setup is motivated by the claim that expansion-only search remains confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer.

What carries the argument

Bidirectional Evolutionary Search (BES) that augments forward expansion with recombination operators on partial trajectories and pairs it with recursive backward decomposition into checkable subgoals for dense verification signals.

If this is right

  • BES produces consistent gains on challenging post-training tasks where mainstream post-training algorithms fail to improve.
  • On three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance.
  • Evolutionary operators generate candidates that lie outside regions with substantial model probability mass.
  • Backward search exponentially reduces the number of samples required to reach a correct answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recombination and decomposition steps could be tested on agentic tasks that involve external tools or multi-step planning beyond the paper's language-model setting.
  • If subgoal verification is noisy or incomplete, the density advantage of backward search would shrink and overall sample efficiency would drop.
  • The entropy-shell argument suggests that similar recombination operators might help in other generative domains where autoregressive sampling is the default.

Load-bearing premise

The proposed evolution operators can reliably recombine partial trajectories into useful new candidates and the model can produce checkable subgoals whose verification signals meaningfully guide the forward search.

What would settle it

An experiment on the post-training tasks or the three benchmarks in which BES produces no gains over standard methods, or in which recombined candidates remain inside the same probability-mass region as autoregressive rollouts.

Figures

Figures reproduced from arXiv: 2605.28814 by Guowei Xu, Himabindu Lakkaraju, Huangyuan Su, Sham M. Kakade, Weirui Ye, Yilun Du, Zhenting Qi.

Figure 1
Figure 1. Figure 1: Comparison of tree search and Bidirectional Evolutionary Search ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Forward search operators. (a) Expansion: the policy generates new steps (yellow). (b) Combination: two trajectories sharing a common prefix (P1–P2) have their distinct suffixes concate￾nated into a single candidate. (c) Deletion: an interior step (P2) is removed. (d) Translocation: one step in Path A (A2) is replaced by a step from Path B (B2). (e) Crossover: Path A is cut at a splice point and its tail is… view at source ↗
Figure 3
Figure 3. Figure 3: EMA-smoothed validation accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on logical reasoning. We conduct an ablation study on the Knights-and￾Knaves benchmark. On this benchmark, BES com￾bines bidirectional evolutionary search for discov￾ering high-quality samples with MaxRL’s answer reweighting for training. We therefore consider two ablations: (1) removing answer reweighting; and (2) removing the evolution operators to verify that both bidirectional search and… view at source ↗
Figure 5
Figure 5. Figure 5: Case study of BES on a multi-hop reasoning problem. The forward search (top) explores two branches via expansion, both of which lead to wrong answers. A translocation operator then combines a reasoning step from the right branch into the left branch, producing a correct answer. The backward search (bottom) decomposes the original question into two sub-goals and provides dense verification feedback (green/r… view at source ↗
read the original abstract

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes Bidirectional Evolutionary Search (BES), a search framework for self-improving language models that couples forward candidate evolution (using recombination operators on partial trajectories) with backward goal decomposition (producing checkable subgoals for dense feedback). It motivates the approach theoretically by arguing that expansion-only search is confined to a narrow entropy shell while evolutionary operators allow escape and that backward decomposition can exponentially reduce required samples. Experiments are reported to show consistent gains on challenging post-training tasks where mainstream algorithms fail to improve, and superior average and best-case performance versus existing open-source frameworks on three open problem-solving benchmarks at inference time. Code and trained models are released.

Significance. If the empirical claims hold under matched compute and verification conditions, the work could meaningfully advance search-based self-improvement methods by addressing sparse signals and limited exploration in current approaches. The combination of theoretical motivation for entropy-shell escape and sample-complexity reduction, together with the public release of code and models, strengthens potential impact for both post-training and inference-time agentic systems.

minor comments (2)
  1. [Abstract] The abstract states high-level experimental outcomes but does not report effect sizes, exact baselines, or control conditions; adding one or two quantitative sentences would improve clarity without altering the central claims.
  2. The description of the evolution operators and subgoal verification would benefit from an explicit statement of how partial-trajectory recombination is implemented and how verification signals are aggregated in the forward search.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive summary and for recommending minor revision. The assessment correctly captures the core contributions of Bidirectional Evolutionary Search, including the coupling of forward evolutionary recombination with backward subgoal decomposition, the theoretical arguments on entropy-shell escape and sample-complexity reduction, and the empirical results on post-training and inference-time tasks. We are pleased that the public release of code and models is viewed as strengthening potential impact.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context contain no equations, fitted parameters, or self-citations that reduce any claimed prediction or theoretical result to its inputs by construction. Theoretical motivations (entropy shell confinement, exponential sample reduction) are presented as independent derivations rather than re-expressions of prior fitted quantities or author-specific uniqueness theorems. Empirical claims rest on proposed operators and benchmarks without evidence of self-definitional loops or renaming of known results. This is the expected self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about language-model behavior rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Language models can generate partial trajectories that are recombineable into useful new candidates and can produce checkable subgoals.
    Invoked to justify both the forward evolution operators and the backward decomposition step.

pith-pipeline@v0.9.1-grok · 5799 in / 1323 out tokens · 44503 ms · 2026-06-29T12:54:50.051436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    GEPA: Reflective prompt evolution can outperform reinforcement learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...

  2. [2]

    Banzhaf, J.R

    W. Banzhaf, J.R. Koza, C. Ryan, L. Spector, and C. Jacob. Genetic programming.IEEE Intelligent Systems and their Applications, 15(3):74–84, 2000. doi: 10.1109/5254.846288. 10

  3. [3]

    Graph of thoughts: Solving elaborate problems with large language models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024. IS...

  4. [4]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URLhttps://arxiv.org/abs/2407.21787. 2 10

  5. [5]

    Oxford University Press, 1999

    Ronald Aylmer Fisher.The genetical theory of natural selection: a complete variorum edition. Oxford University Press, 1999. 2

  6. [6]

    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, S...

  7. [7]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  8. [8]

    rstar-math: Small LLMs can master math reasoning with self-evolved deep thinking

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small LLMs can master math reasoning with self-evolved deep thinking. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=5zwF1GizFa. 1, 10

  9. [9]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  10. [10]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 8154–8173, Singapore, December 2023. As- sociation for Computationa...

  11. [11]

    Hart, Nils J

    Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determi- nation of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2): 100–107, 1968. doi: 10.1109/TSSC.1968.300136. 10

  12. [12]

    Holland.Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence

    John H. Holland.Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, 04 1992. ISBN 9780262275552. doi: 10.7551/mitpress/1090.001.0001. URL https://doi.org/10.7551/ mitpress/1090.001.0001. 10

  13. [13]

    TreeRL: LLM reinforcement learning with on-policy tree search

    Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL: LLM reinforcement learning with on-policy tree search. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355– 12369, Vien...

  14. [14]

    Ash, and Akshay Krishnamurthy

    Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharp- ening mechanism. InThe Thirteenth International Conference on Learning Representations,

  15. [15]

    URLhttps://openreview.net/forum?id=WJaUkwci9o. 9

  16. [16]

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Zahavy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercuri, Calle Sönne, Bhavik Mehta, Alex Davies...

  17. [17]

    Tree search for LLM agent reinforcement learning

    Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for LLM agent reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ZpQwAFhU13. 2, 3, 8, 10

  18. [18]

    FaithScore: Fine-grained evaluations of hallucina- tions in large vision-language models

    Marek Kadlˇcík and Michal Štefánik. Self-training language models for arithmetic reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12378–12386, Miami, Florida, USA, Novem- ber 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp

  19. [19]

    URLhttps://aclanthology.org/2024.findings-emnlp.721/. 9

  20. [20]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesvari. Bandit based monte-carlo planning. InEuropean Confer- ence on Machine Learning, 2006. URL https://api.semanticscholar.org/CorpusID: 15184765. 1, 3

  21. [21]

    ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution, 2025. URL https://arxiv.org/abs/2509.19349. 1, 2, 8, 10

  22. [22]

    Lawler and D

    Eugene L. Lawler and D. E. Wood. Branch-and-bound methods: A survey.Oper. Res., 14: 699–719, 1966. URLhttps://api.semanticscholar.org/CorpusID:36099120. 10

  23. [23]

    Li et al., ”Competition-Level Code Gen- eration with AlphaCode,”Science, vol

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  24. [24]

    Dimakis, Matei Zaharia, and Ion Stoica

    Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. Skydiscover: A flexible framework for ai-driven scientific and algorithmic discovery, 2026. URL...

  25. [25]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

  26. [26]

    Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search

    Kou Misaki, Yuichi Inoue, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling LLM inference-time compute with adaptive branching tree search. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025. URLhttps://openreview. net/forum?id=3HF6yogDEm. 10

  27. [27]

    Some genetic aspects of sex.The American Naturalist, 66(703): 118–138, 1932

    Hermann Joseph Muller. Some genetic aspects of sex.The American Naturalist, 66(703): 118–138, 1932. 2

  28. [28]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

  29. [29]

    Self-improving language models for evolutionary program synthesis: A case study on arc-agi, 2026

    Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-improving language models for evolutionary program synthesis: A case study on arc-agi, 2026. URL https://arxiv.org/ abs/2507.14172. 10 14

  30. [30]

    Mutual reasoning makes smaller LLMs stronger problem-solver

    Zhenting Qi, Mingyuan MA, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller LLMs stronger problem-solver. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=6aHUmotXaw. 1

  31. [31]

    Mathematical discoveries from program search with large language models

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024. 10

  32. [32]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve. 2, 8

  33. [33]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=vAElhFcKW6. 9

  34. [34]

    Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.J

    Rainer Storn and Kenneth Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.J. of Global Optimization, 11(4):341–359, December 1997. ISSN 0925-5001. doi: 10.1023/A:1008202821328. URL https://doi.org/ 10.1023/A:1008202821328. 10

  35. [35]

    Maximum likelihood reinforcement learning, 2026

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning, 2026. URLhttps://arxiv.org/abs/2602.02710. 2, 7

  36. [36]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  37. [37]

    MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022. doi: 10.1162/tacl_a_00475. URL https: //aclanthology.org/2022.tacl-1.31/. 7

  38. [38]

    V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=ehfRiF0R3a. 9

  39. [39]

    Efficient evolutionary search over chemical space with large language models

    Haorui Wang, Marta Skreta, Cher Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alan Aspuru-Guzik, Kirill Neklyudov, and Chao Zhang. Efficient evolutionary search over chemical space with large language models. InThe Thirteenth International Conference on Learning Representations,

  40. [40]

    URLhttps://openreview.net/forum?id=awWiNvQwf3. 10

  41. [41]

    Thetaevolve: Test-time learning on open problems,

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Thetaevolve: Test-time learning on open problems,

  42. [42]

    URLhttps://arxiv.org/abs/2511.23473. 10

  43. [43]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

  44. [44]

    Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=VNckp7JEHn. 1

  45. [45]

    Large language interpolators can learn logical reasoning: A study on knights and knaves puzzles

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. Large language interpolators can learn logical reasoning: A study on knights and knaves puzzles. InThe 4th Workshop on Mathematical Reasoning and AI at NeurIPS’24, 2024. URLhttps://openreview.net/forum?id=mxX8WdPCx9. 7

  46. [46]

    Monte carlo tree search boosts reasoning via iterative preference learning

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning. InThe First Workshop on System-2 Reasoning at Scale, NeurIPS’24, 2024. URL https: //openreview.net/forum?id=s004OmYP2P. 10

  47. [47]

    Griffiths, Yuan Cao, and Karthik R Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=5Xc1ecxO1h. 2, 3, 10

  48. [48]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  49. [49]

    Self-rewarding language models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=0NphYCmgua. 1, 9 16

  50. [50]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5. 1

  51. [51]

    Learning to Discover at Test Time

    Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time, 2026. URLhttps://arxiv.org/abs/2601.16175. 1

  52. [52]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?id=_3ELRdg2sgI. 1, 9

  53. [53]

    Self-taught opti- mizer (STOP): Recursively self-improving code generation

    Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. Self-taught opti- mizer (STOP): Recursively self-improving code generation. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=46Zgqo4QIU. 1

  54. [54]

    ### Final Answer

    Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=8rcFOqEud5. 10 17 Appendix - Table of Contents A Pseudo Code 19 B Formal Definitions of ...

  55. [55]

    Identify the single most distinctive trick / mechanism it uses (a particular initialization, a refinement step, a numerical formulation, a heuristic, ...)

  56. [56]

    attacks a different failure mode than what the current program already handles)

    Decide whether that trick is compatible with the current program and is likely additive (i.e. attacks a different failure mode than what the current program already handles). Then produce a NEW full program that is the current program PLUS the compatible tricks from the inspirations stitched in. Be explicit in the <DESCRIPTION> about which trick came from...

  57. [60]

    Use scipy optimize, LP, or SLSQP to optimize variables given candidate structures 32 F.2.2 Deletion Deletion Iteration Prompt # Current program Here is the current program. The evolution loop has been stuck on iterations of approaches similar to this one --- incremental tweaks have not been moving the score: ```{language} {code_content} ``` Performance me...

  58. [61]

    Identify components of the current code that look unreasonable or that may be holding the search inside a local optimum (heuristics that don't pay off, design choices the search keeps committing to, dead branches, parameter sweeps that add little)

  59. [62]

    DELETE those components

  60. [63]

    Do not iterate on the current implementation

    Rewrite the program from a fundamentally new perspective: pick an algorithm class, data structure, or strategy that the current program does NOT use, and commit fully to it. Do not iterate on the current implementation. Do not stitch new code onto the old skeleton. Commit fully to a different approach. A fundamental change replaces the solution representa...

  61. [67]

    Aim to combine the best parts of both code implementations that improves the score

    Use scipy optimize, LP, or SLSQP to optimize variables given candidate structures F.2.3 Crossover Crossover Iteration Prompt # Current program Here is the current program we are trying to improve (you will need to propose a new program with the same inputs and outputs as the original program, but with improved internal implementation): ```{language} {code...

  62. [70]

    The optimization routine is critical - use models with carefully tuned parameters 34

  63. [71]

    distant relative

    Use scipy optimize, LP, or SLSQP to optimize variables given candidate structures F.2.4 Translocation Translocation Iteration Prompt # Current program (the "near" parent --- keep its skeleton) ```{language} {code_content} ``` Performance metrics: {performance_metrics}{text_feedback_section} # Task: trick translocation from a distant relative Below you wil...

  64. [72]

    Be concrete; name it

    Read it and pick the ONE trick that is most likely to help the current program --- a specific initialization, refinement step, constraint formulation, numerical detail, or heuristic. Be concrete; name it

  65. [73]

    Keep the rest of the current program intact

    Transplant ONLY that trick into the current program. Keep the rest of the current program intact. Do NOT also fold in other ideas from the donor and do NOT broadly rewrite the recipient

  66. [74]

    Argue in the <DESCRIPTION>: which trick, why this one, and why grafting it onto the current skeleton is more promising than full crossover

    Adapt naming / signatures so the transplant compiles, but do not refactor surrounding code beyond what the transplant strictly requires. Argue in the <DESCRIPTION>: which trick, why this one, and why grafting it onto the current skeleton is more promising than full crossover. Key directions to explore:

  67. [75]

    The optimal arrangement may involve heterogeneous or variable-sized elements

  68. [76]

    Strong solutions often use hybrid global-local patterns

  69. [77]

    The optimization routine is critical - use models with carefully tuned parameters

  70. [78]

    G.1 Circle Packing (Square) The best program for the n=26 unit-square instance is a hybrid global optimiser

    Use scipy optimize, LP, or SLSQP to optimize variables given candidate structures G Identified Programs for Open Problem Solving Tasks This appendix summarizes, for each open-problem benchmark, the structure of the best program discovered byBES. G.1 Circle Packing (Square) The best program for the n=26 unit-square instance is a hybrid global optimiser. It...