Inducing Reasoning Primitives from Agent Traces

Jiarui Yan; Joshua Momo; William W. Cohen; Zhihan Lei

arxiv: 2606.02994 · v1 · pith:5HR2OWWAnew · submitted 2026-06-02 · 💻 cs.AI · cs.CL

Inducing Reasoning Primitives from Agent Traces

Zhihan Lei , Jiarui Yan , Joshua Momo , William W. Cohen This is my paper

Pith reviewed 2026-06-28 10:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords reasoning primitivesReAct agentstrace miningpseudo-toolsagent librariesinduction from tracesLLM compositionplanning benchmarks

0 comments

The pith

Libraries of reasoning primitives induced from agent traces let LLMs outperform the agents that generated those traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Reasoning Primitive Induction, a method that mines successful ReAct traces from LLM agents, clusters the most recurrent reasoning moves, and converts them into a compact library of pseudo-tools. Each pseudo-tool is defined only by a natural-language docstring that the same LLM interprets when the tool is invoked, after which a standard ReAct loop composes the primitives to solve new problems. On three main benchmarks the induced libraries raise accuracy from 30 to 74 percent, 38 to 68 percent, and 7 to 29 percent respectively. The same fixed library also exceeds zero-shot Chain-of-Thought on every subtask examined and matches or beats expert-written decompositions while using less inference cost than prior library methods. The central claim is therefore that recurrent reasoning patterns visible in successful traces can be extracted once and reused to improve performance on the same class of tasks.

Core claim

Reasoning Primitive Induction mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The induced libraries outperform the very agent that generated their traces by 44, 30, and 22 percentage points on the three main tasks and improve over zero-shot Chain-of-Thought on every subtask while matching or surpassing expert-authored decompositions at lower average cost.

What carries the argument

Reasoning Primitive Induction, which clusters recurrent reasoning moves from successful traces and converts them into pseudo-tools specified by natural-language docstrings.

If this is right

Induced libraries raise accuracy from 30 to 74 percent on RuleArena NBA, 38 to 68 percent on MuSR team allocation, and 7 to 29 percent on NatPlan meeting planning.
A single fixed configuration improves over zero-shot Chain-of-Thought on every one of five subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning.
The libraries match or surpass expert-authored decompositions while using lower average inference cost than AWM.
The method works across narrative deduction, rule application, and planning tasks without task-specific tuning.
Pseudo-tools remain usable by the same LLM that produced the original traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated cycles of trace collection and library induction could produce progressively stronger agent libraries without external supervision.
The same clustering approach might expose reusable reasoning patterns in domains beyond the three benchmarks tested here.
Because the primitives are stored only as docstrings, they could be inspected or edited by humans to inject domain knowledge.
The cost reduction relative to AWM suggests the approach may scale to larger trace collections without proportional growth in inference expense.

Load-bearing premise

Recurrent reasoning moves identified by clustering on successful traces will form coherent, generalizable pseudo-tools whose natural-language docstrings can be reliably interpreted and composed by the LLM at test time.

What would settle it

Applying the induced library to the same tasks and observing no accuracy gain or a drop relative to the original un-augmented agent on the RuleArena, MuSR, or NatPlan benchmarks.

Figures

Figures reproduced from arXiv: 2606.02994 by Jiarui Yan, Joshua Momo, William W. Cohen, Zhihan Lei.

**Figure 1.** Figure 1: Worked example of reasoning primitive induction on MUSR murder. (A) Three verbatim Thought strings from one successful REACT rollout. (B) Each thought is mapped to a reasoning-move label (Algorithm 1); the bar chart shows the top-5 canonical categories. (C) The top category is synthesized into a typed pseudo-tool whose body is realized by an LLM at invocation time (full library in Appendix A). Induction ex… view at source ↗

**Figure 2.** Figure 2: Reasoning primitive induction. The pipeline (Algorithm 1) consumes a corpus of ReAct rollouts and emits a library L in four stages: filter for correctness, extract per-step Thought strings, categorize and merge them into canonical reasoning moves via two LLM calls, and synthesize the top K as typed primitives. The recipe has two free parameters (K, m), three LLM prompts, and one configuration unchanged acr… view at source ↗

**Figure 3.** Figure 3: Exceeding the source agent. Test-set accuracy of the source REACT agent versus the induced library. Induced libraries exceed their source by up to +44pp on RuleArena NBA, +30pp on MUSR team allocation, and +22pp on NatPlan meeting (all paired-∆ CIs strictly positive; Appendix G). NatPlan trip is excluded from method comparisons because all methods score below 10%. method instead targets individual reasonin… view at source ↗

**Figure 4.** Figure 4: Discovery matches or surpasses expert design. Test-set accuracy of expert-authored decompositions versus the induced library. Induced libraries match or exceed the expert spec on every comparable cell; the team (+17pp) and meeting (+15pp) gains are statistically significant, while murder, object, and NBA are within 95% bootstrap CIs (Appendix G). Bold deltas indicate significance. reasoning primitive induc… view at source ↗

read the original abstract

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The trace-mining approach to building reusable pseudo-tools looks workable and reports strong gains, but the clustering-to-docstring step needs close checking for robustness.

read the letter

The main takeaway is that mining successful ReAct traces, clustering repeated moves, and turning the clusters into natural-language pseudo-tools can produce a library that the same LLM then uses to solve new problems better than the original agent. The reported jumps—30 to 74 on RuleArena NBA, 38 to 68 on MuSR, 7 to 29 on NatPlan—are large enough to notice.

What the paper actually does is run a single-pass extraction on traces, group similar reasoning steps, attach docstrings, and feed the resulting typed tools back into a standard ReAct loop at test time. This is distinct from prior ReAct work or simple few-shot prompting because the primitives are derived directly from the agent's own behavior rather than hand-written. The consistent improvement over zero-shot CoT and the edge over expert decompositions at lower cost are the concrete results.

The soft spot is exactly the one the stress-test flags: clustering on trace similarity does not automatically guarantee that the resulting docstrings are coherent or context-independent enough for reliable composition on held-out cases. The abstract gives no detail on the clustering method, exclusion rules, or any validation of the docstrings, so it is possible the gains rest on favorable choices in how the frequent moves were turned into specs. If the full paper shows ablations or error analysis on that transfer step, the claim strengthens; without it the generalization remains an assumption.

This is aimed at people building or evaluating LLM agents for planning and deduction tasks. A reader who wants a practical pipeline for turning transient traces into reusable components will find the setup useful to try. The work deserves a serious referee because the method is simple to reproduce and the empirical pattern, if it holds under scrutiny, is worth testing in other agent settings.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, converts the most frequent into a compact library of typed pseudo-tools specified by natural-language docstrings, and composes them via a standard ReAct loop at test time. The central empirical claim is that the induced libraries outperform the trace-generating agent itself (+44pp on RuleArena NBA from 30 to 74, +30pp on MuSR team allocation from 38 to 68, +22pp on NatPlan meeting planning from 7 to 29) while also improving over zero-shot CoT on all five subtasks, matching or surpassing expert-authored decompositions, and doing so at lower average inference cost.

Significance. If the performance claims and transfer from mined primitives to held-out composition hold under verification, the work would demonstrate a practical route to automatically extracting reusable reasoning components from agent behavior. The single fixed configuration across narrative, rule, and planning domains and the reported cost advantage would be notable strengths for agent design.

major comments (3)

[§3] §3 (Method): The conversion of clustered trace moves into natural-language docstrings includes no validation, human review, or execution-grounded refinement step; this is load-bearing for the claim that the primitives are reliably interpretable and composable by the base LLM on held-out problems, as clustering on surface similarity alone does not ensure semantic coherence or context-independence.
[§4] §4 (Experiments) and abstract: The headline gains (+44pp, +30pp, +22pp) and cross-subtask improvements are reported without details on the clustering algorithm, exclusion rules for traces, number of clusters selected, or any ablation on docstring quality; without these, it is unclear whether the results rest on post-hoc choices rather than the single-pass procedure.
[Table 1] Table 1 or equivalent results table: No error bars, statistical significance tests, or failure-case analysis are referenced for the five subtasks, which undermines the claim that a single configuration reliably outperforms baselines and the original agent across domains.

minor comments (1)

[Abstract] Abstract: The description of the clustering step could be expanded with one sentence on the similarity metric or algorithm used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We respond to each major comment below.

read point-by-point responses

Referee: [§3] §3 (Method): The conversion of clustered trace moves into natural-language docstrings includes no validation, human review, or execution-grounded refinement step; this is load-bearing for the claim that the primitives are reliably interpretable and composable by the base LLM on held-out problems, as clustering on surface similarity alone does not ensure semantic coherence or context-independence.

Authors: The method is designed as a single-pass automated procedure without human intervention, relying on the empirical success of the induced primitives in the ReAct composition on held-out tasks to demonstrate their interpretability. Clustering is performed on recurrent moves from successful traces, which by construction are context-independent enough to be reused. We will add examples of the generated docstrings and their application in the revised §3 to illustrate semantic coherence. revision: partial
Referee: [§4] §4 (Experiments) and abstract: The headline gains (+44pp, +30pp, +22pp) and cross-subtask improvements are reported without details on the clustering algorithm, exclusion rules for traces, number of clusters selected, or any ablation on docstring quality; without these, it is unclear whether the results rest on post-hoc choices rather than the single-pass procedure.

Authors: We will expand §3 and §4 to provide full details on the clustering algorithm, trace selection criteria, and cluster count selection process. We will also include an ablation comparing the full method to variants with different docstring generation approaches to show robustness. The configuration is fixed across all experiments as stated, and these details will confirm it is not post-hoc. revision: yes
Referee: [Table 1] Table 1 or equivalent results table: No error bars, statistical significance tests, or failure-case analysis are referenced for the five subtasks, which undermines the claim that a single configuration reliably outperforms baselines and the original agent across domains.

Authors: We agree that additional statistical rigor would strengthen the presentation. In the revision, we will add error bars based on multiple runs, perform significance tests for the reported improvements, and include a short analysis of failure cases to support the reliability of the single configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical extraction on held-out tasks

full rationale

The paper describes a single-pass mining procedure that clusters recurrent moves from successful ReAct traces, converts them to natural-language docstring pseudo-tools, and evaluates composition on held-out problems. No equations, fitted parameters, or self-citations are presented that would make the reported gains (+44pp, +30pp, +22pp) equivalent to the input traces by construction. The method contains no self-definitional loops, no renaming of known results as novel derivations, and no load-bearing uniqueness theorems imported from the authors' prior work. The central result is an empirical comparison against the trace-generating agent and baselines, which is externally falsifiable on the stated benchmarks and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the approach relies on standard LLM prompting and clustering assumptions that are not enumerated.

pith-pipeline@v0.9.1-grok · 5712 in / 1162 out tokens · 16005 ms · 2026-06-28T10:52:31.461779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Snell, J

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In ICLR, 2025

2025
[2]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct : Synergizing reasoning and acting in language models. In ICLR, 2023

2023
[3]

Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory. In ICML, 2025

2025
[4]

Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried. Inducing programmatic skills for agentic tasks. In COLM, 2025

2025
[5]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y. Luo, and C. Wu. AFlow : Automating agentic workflow generation. In ICLR (Oral), 2025

2025
[6]

Z. Shi, Y. Zhu, J. Shi, X. Zhang, L. Wang, and C. Miao. STIR : Internalizing LLM reasoning via discovery and replay of latent actions. arXiv preprint arXiv:2602.04925, 2026

work page arXiv 2026
[7]

Sarch, L

G. Sarch, L. Jang, M. J. Tarr, W. W. Cohen, K. Marino, and K. Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. In NeurIPS, 2024

2024
[8]

P. Zhou, J. Pujara, X. Ren, X. Chen, H.-T. Cheng, Q. V. Le, E. H. Chi, D. Zhou, S. Mishra, and H. S. Zheng. Self-discover: Large language models self-compose reasoning structures. In NeurIPS, 2024

2024
[9]

Khattab, A

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy : Compiling declarative language model calls into state-of-the-art pipelines. In ICLR, 2024

2024
[10]

L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, and O. Khattab. GEPA : Reflective prompt evolution can outperform reinforcement learning. In ICLR (Oral), 2026

2026
[11]

T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In ICLR, 2023

2023
[12]

Kumar, V

A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via reinforcement learning. In ICLR, 2025

2025
[13]

X. Wang, J. Wei, D. Schuurmans, Q. Le, S. Narang, A. Chowdhery, D. Zhou, and E. H. Chi. Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023

2023
[14]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

2022
[15]

W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. TMLR, 2023

2023
[16]

A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024

2024
[17]

Sprague, X

Z. Sprague, X. Ye, K. Bostrom, S. Chaudhuri, and G. Durrett. MuSR : Testing the limits of chain-of-thought with multistep soft reasoning. In ICLR, 2024

2024
[18]

C. A. Cohen and W. W. Cohen. Watch your steps: Observable and modular chains of thought. arXiv preprint arXiv:2409.15359, 2024

work page arXiv 2024
[19]

J. Leng, C. A. Cohen, Z. Zhang, C. Xiong, and W. W. Cohen. Semi-structured LLM reasoners can be rigorously audited. arXiv preprint arXiv:2505.24217, 2025

work page arXiv 2025
[20]

H. S. Zheng, S. Mishra, H. Zhang, X. Chen, M. Chen, A. Nova, L. Hou, H.-T. Cheng, Q. V. Le, E. H. Chi, and D. Zhou. NATURAL PLAN : Benchmarking LLM s on natural language planning. arXiv preprint arXiv:2406.04520, 2024

work page arXiv 2024
[21]

Ellis, C

K. Ellis, C. Wong, M. Nye, M. Sabl \'e -Meyer, L. Morales, L. Hewitt, L. Cary, A. Solar-Lezama, and J. B. Tenenbaum. DreamCoder : Bootstrapping inductive program synthesis with wake-sleep library learning. In PLDI, 2021

2021
[22]

Grand, L

G. Grand, L. Wong, M. Bowers, T. X. Olausson, M. Liu, J. B. Tenenbaum, and J. Andreas. LILO : Learning interpretable libraries by compressing and documenting code. In ICLR, 2024

2024
[23]

Bowers, T

M. Bowers, T. X. Olausson, L. Wong, G. Grand, J. B. Tenenbaum, K. Ellis, and A. Solar-Lezama. Top-down synthesis for library learning. In POPL, 2023

2023
[24]

G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager : An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Shinn, F

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion : Language agents with verbal reinforcement learning. In NeurIPS, 2023

2023
[26]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. ExpeL : LLM agents are experiential learners. In AAAI, 2024

2024
[27]

Zelikman, Y

E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. STaR : Bootstrapping reasoning with reasoning. In NeurIPS, 2022

2022
[28]

Y. Qu, A. Singh, Y. Lee, A. Setlur, R. Salakhutdinov, C. Finn, and A. Kumar. RLAD : Training LLM s to discover abstractions for solving reasoning problems. arXiv preprint arXiv:2510.02263, 2025

work page arXiv 2025
[29]

Hosseini, X

A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STaR : Training verifiers for self-taught reasoners. In COLM, 2024

2024
[30]

L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In ACL, 2023

2023
[31]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

2023
[32]

T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou. Large language models as tool makers. In ICLR, 2024

2024
[33]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL : Program-aided language models. In ICML, 2023

2023
[34]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In IEEE ICRA, 2023

2023
[35]

Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-Burch. Faithful chain-of-thought reasoning. In IJCNLP-AACL, 2023

2023
[36]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

2023
[37]

Opsahl-Ong, M

K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In EMNLP, 2024

2024
[38]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dess \`i , R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2023

2023
[39]

Z. Wang, D. Fried, and G. Neubig. TroVE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In ICML, 2024

2024
[40]

L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji. CRAFT : Customizing LLMs by creating and retrieving from specialized toolsets. In ICLR, 2024

2024
[41]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su. SkillWeaver : Web agents can self-improve by discovering and honing skills. arXiv preprint arXiv:2504.07079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

R. Zhou, W. Hua, L. Pan, S. Cheng, X. Wu, E. Yu, and W. Y. Wang. RuleArena : A benchmark for rule-guided reasoning with LLMs in real-world scenarios. In ACL, 2025

2025

[1] [1]

Snell, J

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In ICLR, 2025

2025

[2] [2]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct : Synergizing reasoning and acting in language models. In ICLR, 2023

2023

[3] [3]

Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory. In ICML, 2025

2025

[4] [4]

Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried. Inducing programmatic skills for agentic tasks. In COLM, 2025

2025

[5] [5]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y. Luo, and C. Wu. AFlow : Automating agentic workflow generation. In ICLR (Oral), 2025

2025

[6] [6]

Z. Shi, Y. Zhu, J. Shi, X. Zhang, L. Wang, and C. Miao. STIR : Internalizing LLM reasoning via discovery and replay of latent actions. arXiv preprint arXiv:2602.04925, 2026

work page arXiv 2026

[7] [7]

Sarch, L

G. Sarch, L. Jang, M. J. Tarr, W. W. Cohen, K. Marino, and K. Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. In NeurIPS, 2024

2024

[8] [8]

P. Zhou, J. Pujara, X. Ren, X. Chen, H.-T. Cheng, Q. V. Le, E. H. Chi, D. Zhou, S. Mishra, and H. S. Zheng. Self-discover: Large language models self-compose reasoning structures. In NeurIPS, 2024

2024

[9] [9]

Khattab, A

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy : Compiling declarative language model calls into state-of-the-art pipelines. In ICLR, 2024

2024

[10] [10]

L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, and O. Khattab. GEPA : Reflective prompt evolution can outperform reinforcement learning. In ICLR (Oral), 2026

2026

[11] [11]

T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In ICLR, 2023

2023

[12] [12]

Kumar, V

A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via reinforcement learning. In ICLR, 2025

2025

[13] [13]

X. Wang, J. Wei, D. Schuurmans, Q. Le, S. Narang, A. Chowdhery, D. Zhou, and E. H. Chi. Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023

2023

[14] [14]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

2022

[15] [15]

W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. TMLR, 2023

2023

[16] [16]

A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024

2024

[17] [17]

Sprague, X

Z. Sprague, X. Ye, K. Bostrom, S. Chaudhuri, and G. Durrett. MuSR : Testing the limits of chain-of-thought with multistep soft reasoning. In ICLR, 2024

2024

[18] [18]

C. A. Cohen and W. W. Cohen. Watch your steps: Observable and modular chains of thought. arXiv preprint arXiv:2409.15359, 2024

work page arXiv 2024

[19] [19]

J. Leng, C. A. Cohen, Z. Zhang, C. Xiong, and W. W. Cohen. Semi-structured LLM reasoners can be rigorously audited. arXiv preprint arXiv:2505.24217, 2025

work page arXiv 2025

[20] [20]

H. S. Zheng, S. Mishra, H. Zhang, X. Chen, M. Chen, A. Nova, L. Hou, H.-T. Cheng, Q. V. Le, E. H. Chi, and D. Zhou. NATURAL PLAN : Benchmarking LLM s on natural language planning. arXiv preprint arXiv:2406.04520, 2024

work page arXiv 2024

[21] [21]

Ellis, C

K. Ellis, C. Wong, M. Nye, M. Sabl \'e -Meyer, L. Morales, L. Hewitt, L. Cary, A. Solar-Lezama, and J. B. Tenenbaum. DreamCoder : Bootstrapping inductive program synthesis with wake-sleep library learning. In PLDI, 2021

2021

[22] [22]

Grand, L

G. Grand, L. Wong, M. Bowers, T. X. Olausson, M. Liu, J. B. Tenenbaum, and J. Andreas. LILO : Learning interpretable libraries by compressing and documenting code. In ICLR, 2024

2024

[23] [23]

Bowers, T

M. Bowers, T. X. Olausson, L. Wong, G. Grand, J. B. Tenenbaum, K. Ellis, and A. Solar-Lezama. Top-down synthesis for library learning. In POPL, 2023

2023

[24] [24]

G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager : An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Shinn, F

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion : Language agents with verbal reinforcement learning. In NeurIPS, 2023

2023

[26] [26]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. ExpeL : LLM agents are experiential learners. In AAAI, 2024

2024

[27] [27]

Zelikman, Y

E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. STaR : Bootstrapping reasoning with reasoning. In NeurIPS, 2022

2022

[28] [28]

Y. Qu, A. Singh, Y. Lee, A. Setlur, R. Salakhutdinov, C. Finn, and A. Kumar. RLAD : Training LLM s to discover abstractions for solving reasoning problems. arXiv preprint arXiv:2510.02263, 2025

work page arXiv 2025

[29] [29]

Hosseini, X

A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STaR : Training verifiers for self-taught reasoners. In COLM, 2024

2024

[30] [30]

L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In ACL, 2023

2023

[31] [31]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

2023

[32] [32]

T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou. Large language models as tool makers. In ICLR, 2024

2024

[33] [33]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL : Program-aided language models. In ICML, 2023

2023

[34] [34]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In IEEE ICRA, 2023

2023

[35] [35]

Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-Burch. Faithful chain-of-thought reasoning. In IJCNLP-AACL, 2023

2023

[36] [36]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

2023

[37] [37]

Opsahl-Ong, M

K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In EMNLP, 2024

2024

[38] [38]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dess \`i , R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2023

2023

[39] [39]

Z. Wang, D. Fried, and G. Neubig. TroVE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In ICML, 2024

2024

[40] [40]

L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji. CRAFT : Customizing LLMs by creating and retrieving from specialized toolsets. In ICLR, 2024

2024

[41] [41]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su. SkillWeaver : Web agents can self-improve by discovering and honing skills. arXiv preprint arXiv:2504.07079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

R. Zhou, W. Hua, L. Pan, S. Cheng, X. Wu, E. Yu, and W. Y. Wang. RuleArena : A benchmark for rule-guided reasoning with LLMs in real-world scenarios. In ACL, 2025

2025