Inducing Reasoning Primitives from Agent Traces
Pith reviewed 2026-06-28 10:52 UTC · model grok-4.3
The pith
Libraries of reasoning primitives induced from agent traces let LLMs outperform the agents that generated those traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning Primitive Induction mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The induced libraries outperform the very agent that generated their traces by 44, 30, and 22 percentage points on the three main tasks and improve over zero-shot Chain-of-Thought on every subtask while matching or surpassing expert-authored decompositions at lower average cost.
What carries the argument
Reasoning Primitive Induction, which clusters recurrent reasoning moves from successful traces and converts them into pseudo-tools specified by natural-language docstrings.
If this is right
- Induced libraries raise accuracy from 30 to 74 percent on RuleArena NBA, 38 to 68 percent on MuSR team allocation, and 7 to 29 percent on NatPlan meeting planning.
- A single fixed configuration improves over zero-shot Chain-of-Thought on every one of five subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning.
- The libraries match or surpass expert-authored decompositions while using lower average inference cost than AWM.
- The method works across narrative deduction, rule application, and planning tasks without task-specific tuning.
- Pseudo-tools remain usable by the same LLM that produced the original traces.
Where Pith is reading between the lines
- Repeated cycles of trace collection and library induction could produce progressively stronger agent libraries without external supervision.
- The same clustering approach might expose reusable reasoning patterns in domains beyond the three benchmarks tested here.
- Because the primitives are stored only as docstrings, they could be inspected or edited by humans to inject domain knowledge.
- The cost reduction relative to AWM suggests the approach may scale to larger trace collections without proportional growth in inference expense.
Load-bearing premise
Recurrent reasoning moves identified by clustering on successful traces will form coherent, generalizable pseudo-tools whose natural-language docstrings can be reliably interpreted and composed by the LLM at test time.
What would settle it
Applying the induced library to the same tasks and observing no accuracy gain or a drop relative to the original un-augmented agent on the RuleArena, MuSR, or NatPlan benchmarks.
Figures
read the original abstract
ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, converts the most frequent into a compact library of typed pseudo-tools specified by natural-language docstrings, and composes them via a standard ReAct loop at test time. The central empirical claim is that the induced libraries outperform the trace-generating agent itself (+44pp on RuleArena NBA from 30 to 74, +30pp on MuSR team allocation from 38 to 68, +22pp on NatPlan meeting planning from 7 to 29) while also improving over zero-shot CoT on all five subtasks, matching or surpassing expert-authored decompositions, and doing so at lower average inference cost.
Significance. If the performance claims and transfer from mined primitives to held-out composition hold under verification, the work would demonstrate a practical route to automatically extracting reusable reasoning components from agent behavior. The single fixed configuration across narrative, rule, and planning domains and the reported cost advantage would be notable strengths for agent design.
major comments (3)
- [§3] §3 (Method): The conversion of clustered trace moves into natural-language docstrings includes no validation, human review, or execution-grounded refinement step; this is load-bearing for the claim that the primitives are reliably interpretable and composable by the base LLM on held-out problems, as clustering on surface similarity alone does not ensure semantic coherence or context-independence.
- [§4] §4 (Experiments) and abstract: The headline gains (+44pp, +30pp, +22pp) and cross-subtask improvements are reported without details on the clustering algorithm, exclusion rules for traces, number of clusters selected, or any ablation on docstring quality; without these, it is unclear whether the results rest on post-hoc choices rather than the single-pass procedure.
- [Table 1] Table 1 or equivalent results table: No error bars, statistical significance tests, or failure-case analysis are referenced for the five subtasks, which undermines the claim that a single configuration reliably outperforms baselines and the original agent across domains.
minor comments (1)
- [Abstract] Abstract: The description of the clustering step could be expanded with one sentence on the similarity metric or algorithm used.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We respond to each major comment below.
read point-by-point responses
-
Referee: [§3] §3 (Method): The conversion of clustered trace moves into natural-language docstrings includes no validation, human review, or execution-grounded refinement step; this is load-bearing for the claim that the primitives are reliably interpretable and composable by the base LLM on held-out problems, as clustering on surface similarity alone does not ensure semantic coherence or context-independence.
Authors: The method is designed as a single-pass automated procedure without human intervention, relying on the empirical success of the induced primitives in the ReAct composition on held-out tasks to demonstrate their interpretability. Clustering is performed on recurrent moves from successful traces, which by construction are context-independent enough to be reused. We will add examples of the generated docstrings and their application in the revised §3 to illustrate semantic coherence. revision: partial
-
Referee: [§4] §4 (Experiments) and abstract: The headline gains (+44pp, +30pp, +22pp) and cross-subtask improvements are reported without details on the clustering algorithm, exclusion rules for traces, number of clusters selected, or any ablation on docstring quality; without these, it is unclear whether the results rest on post-hoc choices rather than the single-pass procedure.
Authors: We will expand §3 and §4 to provide full details on the clustering algorithm, trace selection criteria, and cluster count selection process. We will also include an ablation comparing the full method to variants with different docstring generation approaches to show robustness. The configuration is fixed across all experiments as stated, and these details will confirm it is not post-hoc. revision: yes
-
Referee: [Table 1] Table 1 or equivalent results table: No error bars, statistical significance tests, or failure-case analysis are referenced for the five subtasks, which undermines the claim that a single configuration reliably outperforms baselines and the original agent across domains.
Authors: We agree that additional statistical rigor would strengthen the presentation. In the revision, we will add error bars based on multiple runs, perform significance tests for the reported improvements, and include a short analysis of failure cases to support the reliability of the single configuration. revision: yes
Circularity Check
No significant circularity; empirical extraction on held-out tasks
full rationale
The paper describes a single-pass mining procedure that clusters recurrent moves from successful ReAct traces, converts them to natural-language docstring pseudo-tools, and evaluates composition on held-out problems. No equations, fitted parameters, or self-citations are presented that would make the reported gains (+44pp, +30pp, +22pp) equivalent to the input traces by construction. The method contains no self-definitional loops, no renaming of known results as novel derivations, and no load-bearing uniqueness theorems imported from the authors' prior work. The central result is an empirical comparison against the trace-generating agent and baselines, which is externally falsifiable on the stated benchmarks and does not reduce to tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Snell, J
C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In ICLR, 2025
2025
-
[2]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct : Synergizing reasoning and acting in language models. In ICLR, 2023
2023
-
[3]
Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory. In ICML, 2025
2025
-
[4]
Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried. Inducing programmatic skills for agentic tasks. In COLM, 2025
2025
-
[5]
Zhang, J
J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y. Luo, and C. Wu. AFlow : Automating agentic workflow generation. In ICLR (Oral), 2025
2025
- [6]
-
[7]
Sarch, L
G. Sarch, L. Jang, M. J. Tarr, W. W. Cohen, K. Marino, and K. Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. In NeurIPS, 2024
2024
-
[8]
P. Zhou, J. Pujara, X. Ren, X. Chen, H.-T. Cheng, Q. V. Le, E. H. Chi, D. Zhou, S. Mishra, and H. S. Zheng. Self-discover: Large language models self-compose reasoning structures. In NeurIPS, 2024
2024
-
[9]
Khattab, A
O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy : Compiling declarative language model calls into state-of-the-art pipelines. In ICLR, 2024
2024
-
[10]
L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, and O. Khattab. GEPA : Reflective prompt evolution can outperform reinforcement learning. In ICLR (Oral), 2026
2026
-
[11]
T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In ICLR, 2023
2023
-
[12]
Kumar, V
A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via reinforcement learning. In ICLR, 2025
2025
-
[13]
X. Wang, J. Wei, D. Schuurmans, Q. Le, S. Narang, A. Chowdhery, D. Zhou, and E. H. Chi. Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023
2023
-
[14]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022
2022
-
[15]
W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. TMLR, 2023
2023
-
[16]
A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024
2024
-
[17]
Sprague, X
Z. Sprague, X. Ye, K. Bostrom, S. Chaudhuri, and G. Durrett. MuSR : Testing the limits of chain-of-thought with multistep soft reasoning. In ICLR, 2024
2024
- [18]
- [19]
- [20]
-
[21]
Ellis, C
K. Ellis, C. Wong, M. Nye, M. Sabl \'e -Meyer, L. Morales, L. Hewitt, L. Cary, A. Solar-Lezama, and J. B. Tenenbaum. DreamCoder : Bootstrapping inductive program synthesis with wake-sleep library learning. In PLDI, 2021
2021
-
[22]
Grand, L
G. Grand, L. Wong, M. Bowers, T. X. Olausson, M. Liu, J. B. Tenenbaum, and J. Andreas. LILO : Learning interpretable libraries by compressing and documenting code. In ICLR, 2024
2024
-
[23]
Bowers, T
M. Bowers, T. X. Olausson, L. Wong, G. Grand, J. B. Tenenbaum, K. Ellis, and A. Solar-Lezama. Top-down synthesis for library learning. In POPL, 2023
2023
-
[24]
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager : An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Shinn, F
N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion : Language agents with verbal reinforcement learning. In NeurIPS, 2023
2023
-
[26]
A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. ExpeL : LLM agents are experiential learners. In AAAI, 2024
2024
-
[27]
Zelikman, Y
E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. STaR : Bootstrapping reasoning with reasoning. In NeurIPS, 2022
2022
- [28]
-
[29]
Hosseini, X
A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STaR : Training verifiers for self-taught reasoners. In COLM, 2024
2024
-
[30]
L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In ACL, 2023
2023
-
[31]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023
2023
-
[32]
T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou. Large language models as tool makers. In ICLR, 2024
2024
-
[33]
L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL : Program-aided language models. In ICML, 2023
2023
-
[34]
Liang, W
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In IEEE ICRA, 2023
2023
-
[35]
Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-Burch. Faithful chain-of-thought reasoning. In IJCNLP-AACL, 2023
2023
-
[36]
Madaan, N
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023
2023
-
[37]
Opsahl-Ong, M
K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In EMNLP, 2024
2024
-
[38]
Schick, J
T. Schick, J. Dwivedi-Yu, R. Dess \`i , R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2023
2023
-
[39]
Z. Wang, D. Fried, and G. Neubig. TroVE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In ICML, 2024
2024
-
[40]
L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji. CRAFT : Customizing LLMs by creating and retrieving from specialized toolsets. In ICLR, 2024
2024
-
[41]
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su. SkillWeaver : Web agents can self-improve by discovering and honing skills. arXiv preprint arXiv:2504.07079, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
R. Zhou, W. Hua, L. Pan, S. Cheng, X. Wu, E. Yu, and W. Y. Wang. RuleArena : A benchmark for rule-guided reasoning with LLMs in real-world scenarios. In ACL, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.