pith. sign in

arxiv: 2607.02390 · v1 · pith:YI7EMO4Vnew · submitted 2026-07-02 · 💻 cs.LG

DecompRL: Solving Harder Problems by Learning Modular Code Generation

Pith reviewed 2026-07-03 16:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningcode generationmodular decompositionlarge language modelshierarchical structurestest-time compute
0
0 comments X

The pith

DecompRL trains models to decompose code problems into modules so their implementations can be recombined into exponentially more solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When a language model's chance of generating a correct solution is near zero, neither repeated sampling nor standard reinforcement learning can help because both remain trapped in that low-probability region. DecompRL instead teaches the model to split a problem into smaller, independent sub-functions, produce multiple implementations for each, and then recombine those pieces. The recombination step turns n modules with k variants each into up to k to the n candidate programs that can be checked on cheap CPU hardware. This approach cuts GPU token usage by roughly fifty times and lets the same base models solve problems on LiveCodeBench and CodeContests that remain out of reach for ordinary generation methods once token budgets exceed one hundred thousand per problem.

Core claim

DecompRL is an RL algorithm that explicitly learns to decompose problems into hierarchical code structures and implement them as independent modules. Recombining k implementations of n modules produces up to k^n candidate solutions, moving the search bottleneck from expensive GPU sampling to inexpensive CPU evaluation and lowering token cost by about fifty times. On LiveCodeBench and CodeContests the method outperforms both standard and diversity-optimized RL baselines beyond 10^5 tokens per problem with Qwen 2.5 7B and Code World Model 32B.

What carries the argument

Decomposition of a problem into independently solvable sub-functions whose separate implementations are recombined into full solutions, with the decomposition policy itself learned by reinforcement learning.

If this is right

  • Recombination of k implementations across n modules yields up to k^n candidates at CPU cost.
  • GPU token cost drops by roughly fifty times compared with direct sampling.
  • Models reach correct solutions on problems where base-policy probability is near zero.
  • Performance gains appear on LiveCodeBench and CodeContests once token budgets exceed 10^5 per problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition-plus-recombination pattern could be tested on domains such as mathematical proofs if they contain clear sub-problems.
  • Increasing the depth or number of modules might further enlarge the effective search space without extra GPU sampling.
  • Training explicitly for modularity may prove more sample-efficient than scaling test-time compute alone.

Load-bearing premise

Problems admit decompositions into sub-functions that can be solved and implemented independently and then recombined into a correct full solution.

What would settle it

A benchmark run in which modular recombination produces no additional correct solutions even after the RL stage has converged and the number of module variants is increased.

read the original abstract

How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining $k$ implementations of $n$ modules yields up to $k^{n}$ candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by $\sim$50$\times$. On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving problems that standard generation cannot reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces DecompRL, an RL algorithm that trains LLMs to decompose coding problems into modular sub-functions whose separate implementations are recombined to yield up to k^n candidate solutions. This shifts the search bottleneck from GPU inference to CPU evaluation. On LiveCodeBench and CodeContests with Qwen 2.5 7B and Code World Model 32B, DecompRL outperforms standard and diversity-optimized RL baselines beyond 10^5 tokens per problem while solving instances unreachable by direct generation.

Significance. If the results hold, the work demonstrates a practical route to scaling beyond the limits of repeated sampling and standard RL by changing problem structure via learned modularity. Strengths include the use of verifiable rewards, explicit CPU-side enumeration, ablations, and scaling plots that directly support the token-cost and performance claims.

minor comments (3)
  1. [Abstract] Abstract: the statement that DecompRL 'solves problems that standard generation cannot reach' would be strengthened by a brief quantitative note on how many such problems were solved and the exact token threshold at which the crossover occurs.
  2. [§5] §5 (Experiments): the recombination mechanics and k^n enumeration are described clearly, but the text could add a short paragraph confirming that sub-function independence was verified post-hoc on the solved instances rather than assumed.
  3. [Figure 3] Figure 3 caption: the scaling curves compare methods at fixed token budgets, but the legend should explicitly note whether the DecompRL training cost is amortized or excluded from the per-problem token count.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the method's strengths (verifiable rewards, CPU-side enumeration, ablations, and scaling plots), and recommendation of minor revision. No major comments appear in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces DecompRL as an empirical RL method for learning modular decompositions in code generation, evaluated on external benchmarks like LiveCodeBench and CodeContests with reported outperformance and ablations. No equations, derivations, or first-principles claims are present that reduce predictions or results to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The central claims rest on verifiable reward signals, recombination mechanics, and scaling experiments that are externally falsifiable and do not rely on internal redefinitions or imported uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on the existence of useful modular decompositions but does not introduce new entities or fit parameters in the provided text.

pith-pipeline@v0.9.1-grok · 5765 in / 1126 out tokens · 24163 ms · 2026-07-03T16:28:38.151212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 36 canonical work pages · 12 internal anchors

  1. [1]

    The Llama 3 Herd of Models , 2024

    Llama Team AI @ Meta. The Llama 3 Herd of Models , 2024

  2. [2]

    Albrecht, Filippos Christianos, and Lukas Sch\"afer

    Stefano V. Albrecht, Filippos Christianos, and Lukas Sch\"afer. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com

  3. [3]

    Thinking fast and slow with deep learning and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017

  4. [4]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  5. [5]

    Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024

    Jingchang Chen, Hongxuan Tang, Zheng Chu, Qianglong Chen, Zekun Wang, Ming Liu, and Bing Qin. Divide-and-conquer meets consensus: Unleashing the power of functions in code generation, 2024. https://arxiv.org/abs/2405.20092

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    X., and Shi, G

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

  8. [8]

    Deep reinforcement learning from human preferences

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2023. https://arxiv.org/abs/1706.03741

  9. [9]

    Soft policy optimization: Online off-policy rl for sequence models

    Taco Cohen, David W Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, and Gabriel Synnaeve. Soft policy optimization: Online off-policy rl for sequence models. arXiv preprint arXiv:2503.05453, 2025

  10. [10]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. https://arxiv.org/abs/2505.22617

  11. [11]

    Gregory, and Norman J

    Dominique de Caen, David A. Gregory, and Norman J. Pullman. The boolean rank of zero-one matrices, 1981

  12. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948

  13. [14]

    Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025

    Kefan Dong and Tengyu Ma. Stp: Self-play llm theorem provers with iterative conjecturing and proving, 2025. https://arxiv.org/abs/2502.00212

  14. [15]

    Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning

    Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023

  15. [16]

    FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...

  16. [17]

    Alphazero-like tree-search can guide large language model decoding and training, 2024

    Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. https://arxiv.org/abs/2309.17179

  17. [18]

    Counterfactual multi-agent policy gradients, 2024

    Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients, 2024. https://arxiv.org/abs/1705.08926

  18. [19]

    Computers and intractability, volume 29

    Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002

  19. [20]

    Alien coding

    Thibault Gauthier, Miroslav Ol s \'a k, and Josef Urban. Alien coding. International Journal of Approximate Reasoning, 162: 0 109009, 2023

  20. [21]

    Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089

  21. [22]

    Symbolic regression with a learned concept library

    Arya Grayeli, Atharva Sehgal, Omar Costilla Reyes, Miles Cranmer, and Swarat Chaudhuri. Symbolic regression with a learned concept library. Advances in Neural Information Processing Systems, 37: 0 44678--44709, 2024

  22. [23]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

  23. [24]

    Peano: Learning formal mathematical reasoning, 2024

    Gabriel Haller, Talia Ringer, Jason Rute, and Brando Miranda. Peano: Learning formal mathematical reasoning, 2024. https://arxiv.org/abs/2405.06738

  24. [25]

    Language models can teach themselves to program better

    Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022

  25. [26]

    Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024

    Chia-Tung Ho, Yikang Shen, Jingyu Pan, Chao Fang, Hao Liu, Tianyu Liu, and Zhiru Zhang. Openllm-rtl: Open dataset and benchmark for llm-aided design of digital circuits, 2024. https://arxiv.org/abs/2407.14326

  26. [27]

    Best-of-n jailbreaking, 2024

    John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking, 2024. https://arxiv.org/abs/2412.03556

  27. [28]

    Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

    Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generation for competitive problem solving, 2024. https://arxiv.org/abs/2405.11403

  28. [29]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024 a

  29. [30]

    Gonzalez, Koushik Sen, and Ion Stoica

    Naman Jain, Tianjun Zhang, Wei - Lin Chiang, Joseph E. Gonzalez, Koushik Sen, and Ion Stoica. Llm-assisted code cleaning for training accurate code generators. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024 b . https://openreview.net/forum?id=maRYffiUpI

  30. [31]

    Decomposed prompting: A modular approach for solving complex tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=\_nGgzQjzaRy

  31. [32]

    Adam: A method for stochastic optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015

  32. [33]

    Hypertree proof search for neural theorem proving

    Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022. https://doi.org/10.48550/arXiv.2205.11491

  33. [34]

    Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules

    Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992, 2023

  34. [35]

    Taco: Topics in algorithmic code generation dataset, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023. https://arxiv.org/abs/2312.14852

  35. [36]

    Competition-level code generation with alphacode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

  36. [37]

    SFS : Smarter code space search improves LLM inference scaling

    Jonathan Light, Yue Wu, Yiyou Sun, Wenchao Yu, Yanchi Liu, Xujiang Zhao, Ziniu Hu, Haifeng Chen, and Wei Cheng. SFS : Smarter code space search improves LLM inference scaling. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=MCHuGOkExF

  37. [38]

    Goedel-prover: A frontier model for open-source automated theorem proving, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025. https://arxiv.org/abs/2502.07640

  38. [39]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  39. [40]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

  40. [41]

    Playing Atari with Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  41. [42]

    Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models, 2025. https://arxiv.org/abs/2410.18252

  42. [43]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  43. [44]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  44. [45]

    Learning formal mathematics from intrinsic motivation

    Gabriel Poesia, David Broman, Nick Haber, and Noah Goodman. Learning formal mathematics from intrinsic motivation. Advances in Neural Information Processing Systems, 37: 0 43032--43057, 2024

  45. [46]

    Formal mathematics statement curriculum learning, 2022

    Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022

  46. [47]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  47. [48]

    Learn to reason efficiently with adaptive length-based reward shaping, 2025

    Srishti Rastogi, Yijia Shao, Rohan Padhye, and Diyi Yang. Learn to reason efficiently with adaptive length-based reward shaping, 2025. https://arxiv.org/abs/2504.01191

  48. [49]

    Rosipal and M

    R. Rosipal and M. Girolami. An expectation-maximization approach to nonlinear component analysis. Neural Computation, 13: 0 505--510, 2001

  49. [50]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. https://arxiv.org/abs/1707.06347

  50. [51]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

  51. [52]

    From code to correctness: Closing the last mile of code generation with hierarchical debugging

    Yuling Shi, Songsong Wang, Chengcheng Wan, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215, 2024

  52. [53]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

  53. [54]

    Sutton and A

    R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, 1998

  54. [55]

    Optimizing language models for inference time objectives using reinforcement learning

    Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025

  55. [56]

    Codeplay: Autotelic learning through collaborative self-play in programming environments

    Laetitia Teodorescu, C \'e dric Colas, Matthew Bowers, Thomas Carta, and Pierre-Yves Oudeyer. Codeplay: Autotelic learning through collaborative self-play in programming environments. In IMOL 2023-Intrinsically Motivated Open-ended Learning workshop at NeurIPS 2023, 2023

  56. [57]

    A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998

  57. [58]

    Hendryx, Summer Yue, and Hugh Zhang

    Evan Z Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, William Song, Vaskar Nath, Ziwen Han, Sean M. Hendryx, Summer Yue, and Hugh Zhang. Planning in natural language improves LLM search for code generation. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=48WAZhwHHw

  58. [59]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. https://doi.org/10.1007/BF00992696

  59. [60]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

  60. [61]

    Goodman, and Yuhuai Tony Wu

    Eric Zelikman, Jesse Mu, Noah D. Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022

  61. [62]

    Parsel: A (de-) compositional framework for algorithmic reasoning with language models

    Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D Goodman, and Nick Haber. Parsel: A (de-) compositional framework for algorithmic reasoning with language models. arXiv preprint arXiv:2212.10561, 2023

  62. [63]

    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024

  63. [64]

    Rest-mcts*: Llm self-training via process reward guided tree search

    Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems, 37: 0 64735--64772, 2024

  64. [65]

    Le, and Ed H

    Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....

  65. [66]

    Le, Ed H

    Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-discover: Large language models self-compose reasoning structures, 2024. https://arxiv.org/abs/2402.03620