pith. sign in

arxiv: 2606.02994 · v1 · pith:5HR2OWWAnew · submitted 2026-06-02 · 💻 cs.AI · cs.CL

Inducing Reasoning Primitives from Agent Traces

Pith reviewed 2026-06-28 10:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reasoning primitivesReAct agentstrace miningpseudo-toolsagent librariesinduction from tracesLLM compositionplanning benchmarks
0
0 comments X

The pith

Libraries of reasoning primitives induced from agent traces let LLMs outperform the agents that generated those traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Reasoning Primitive Induction, a method that mines successful ReAct traces from LLM agents, clusters the most recurrent reasoning moves, and converts them into a compact library of pseudo-tools. Each pseudo-tool is defined only by a natural-language docstring that the same LLM interprets when the tool is invoked, after which a standard ReAct loop composes the primitives to solve new problems. On three main benchmarks the induced libraries raise accuracy from 30 to 74 percent, 38 to 68 percent, and 7 to 29 percent respectively. The same fixed library also exceeds zero-shot Chain-of-Thought on every subtask examined and matches or beats expert-written decompositions while using less inference cost than prior library methods. The central claim is therefore that recurrent reasoning patterns visible in successful traces can be extracted once and reused to improve performance on the same class of tasks.

Core claim

Reasoning Primitive Induction mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The induced libraries outperform the very agent that generated their traces by 44, 30, and 22 percentage points on the three main tasks and improve over zero-shot Chain-of-Thought on every subtask while matching or surpassing expert-authored decompositions at lower average cost.

What carries the argument

Reasoning Primitive Induction, which clusters recurrent reasoning moves from successful traces and converts them into pseudo-tools specified by natural-language docstrings.

If this is right

  • Induced libraries raise accuracy from 30 to 74 percent on RuleArena NBA, 38 to 68 percent on MuSR team allocation, and 7 to 29 percent on NatPlan meeting planning.
  • A single fixed configuration improves over zero-shot Chain-of-Thought on every one of five subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning.
  • The libraries match or surpass expert-authored decompositions while using lower average inference cost than AWM.
  • The method works across narrative deduction, rule application, and planning tasks without task-specific tuning.
  • Pseudo-tools remain usable by the same LLM that produced the original traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated cycles of trace collection and library induction could produce progressively stronger agent libraries without external supervision.
  • The same clustering approach might expose reusable reasoning patterns in domains beyond the three benchmarks tested here.
  • Because the primitives are stored only as docstrings, they could be inspected or edited by humans to inject domain knowledge.
  • The cost reduction relative to AWM suggests the approach may scale to larger trace collections without proportional growth in inference expense.

Load-bearing premise

Recurrent reasoning moves identified by clustering on successful traces will form coherent, generalizable pseudo-tools whose natural-language docstrings can be reliably interpreted and composed by the LLM at test time.

What would settle it

Applying the induced library to the same tasks and observing no accuracy gain or a drop relative to the original un-augmented agent on the RuleArena, MuSR, or NatPlan benchmarks.

Figures

Figures reproduced from arXiv: 2606.02994 by Jiarui Yan, Joshua Momo, William W. Cohen, Zhihan Lei.

Figure 1
Figure 1. Figure 1: Worked example of reasoning primitive induction on MUSR murder. (A) Three verbatim Thought strings from one successful REACT rollout. (B) Each thought is mapped to a reasoning-move label (Algorithm 1); the bar chart shows the top-5 canonical categories. (C) The top category is synthesized into a typed pseudo-tool whose body is realized by an LLM at invocation time (full library in Appendix A). Induction ex… view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning primitive induction. The pipeline (Algorithm 1) consumes a corpus of ReAct rollouts and emits a library L in four stages: filter for correctness, extract per-step Thought strings, categorize and merge them into canonical reasoning moves via two LLM calls, and synthesize the top K as typed primitives. The recipe has two free parameters (K, m), three LLM prompts, and one configuration unchanged acr… view at source ↗
Figure 3
Figure 3. Figure 3: Exceeding the source agent. Test-set accuracy of the source REACT agent versus the induced library. Induced libraries exceed their source by up to +44pp on RuleArena NBA, +30pp on MUSR team allocation, and +22pp on NatPlan meeting (all paired-∆ CIs strictly positive; Appendix G). NatPlan trip is excluded from method comparisons because all methods score below 10%. method instead targets individual reasonin… view at source ↗
Figure 4
Figure 4. Figure 4: Discovery matches or surpasses expert design. Test-set accuracy of expert-authored decompositions versus the induced library. Induced libraries match or exceed the expert spec on every comparable cell; the team (+17pp) and meeting (+15pp) gains are statistically significant, while murder, object, and NBA are within 95% bootstrap CIs (Appendix G). Bold deltas indicate significance. reasoning primitive induc… view at source ↗
read the original abstract

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, converts the most frequent into a compact library of typed pseudo-tools specified by natural-language docstrings, and composes them via a standard ReAct loop at test time. The central empirical claim is that the induced libraries outperform the trace-generating agent itself (+44pp on RuleArena NBA from 30 to 74, +30pp on MuSR team allocation from 38 to 68, +22pp on NatPlan meeting planning from 7 to 29) while also improving over zero-shot CoT on all five subtasks, matching or surpassing expert-authored decompositions, and doing so at lower average inference cost.

Significance. If the performance claims and transfer from mined primitives to held-out composition hold under verification, the work would demonstrate a practical route to automatically extracting reusable reasoning components from agent behavior. The single fixed configuration across narrative, rule, and planning domains and the reported cost advantage would be notable strengths for agent design.

major comments (3)
  1. [§3] §3 (Method): The conversion of clustered trace moves into natural-language docstrings includes no validation, human review, or execution-grounded refinement step; this is load-bearing for the claim that the primitives are reliably interpretable and composable by the base LLM on held-out problems, as clustering on surface similarity alone does not ensure semantic coherence or context-independence.
  2. [§4] §4 (Experiments) and abstract: The headline gains (+44pp, +30pp, +22pp) and cross-subtask improvements are reported without details on the clustering algorithm, exclusion rules for traces, number of clusters selected, or any ablation on docstring quality; without these, it is unclear whether the results rest on post-hoc choices rather than the single-pass procedure.
  3. [Table 1] Table 1 or equivalent results table: No error bars, statistical significance tests, or failure-case analysis are referenced for the five subtasks, which undermines the claim that a single configuration reliably outperforms baselines and the original agent across domains.
minor comments (1)
  1. [Abstract] Abstract: The description of the clustering step could be expanded with one sentence on the similarity metric or algorithm used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The conversion of clustered trace moves into natural-language docstrings includes no validation, human review, or execution-grounded refinement step; this is load-bearing for the claim that the primitives are reliably interpretable and composable by the base LLM on held-out problems, as clustering on surface similarity alone does not ensure semantic coherence or context-independence.

    Authors: The method is designed as a single-pass automated procedure without human intervention, relying on the empirical success of the induced primitives in the ReAct composition on held-out tasks to demonstrate their interpretability. Clustering is performed on recurrent moves from successful traces, which by construction are context-independent enough to be reused. We will add examples of the generated docstrings and their application in the revised §3 to illustrate semantic coherence. revision: partial

  2. Referee: [§4] §4 (Experiments) and abstract: The headline gains (+44pp, +30pp, +22pp) and cross-subtask improvements are reported without details on the clustering algorithm, exclusion rules for traces, number of clusters selected, or any ablation on docstring quality; without these, it is unclear whether the results rest on post-hoc choices rather than the single-pass procedure.

    Authors: We will expand §3 and §4 to provide full details on the clustering algorithm, trace selection criteria, and cluster count selection process. We will also include an ablation comparing the full method to variants with different docstring generation approaches to show robustness. The configuration is fixed across all experiments as stated, and these details will confirm it is not post-hoc. revision: yes

  3. Referee: [Table 1] Table 1 or equivalent results table: No error bars, statistical significance tests, or failure-case analysis are referenced for the five subtasks, which undermines the claim that a single configuration reliably outperforms baselines and the original agent across domains.

    Authors: We agree that additional statistical rigor would strengthen the presentation. In the revision, we will add error bars based on multiple runs, perform significance tests for the reported improvements, and include a short analysis of failure cases to support the reliability of the single configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical extraction on held-out tasks

full rationale

The paper describes a single-pass mining procedure that clusters recurrent moves from successful ReAct traces, converts them to natural-language docstring pseudo-tools, and evaluates composition on held-out problems. No equations, fitted parameters, or self-citations are presented that would make the reported gains (+44pp, +30pp, +22pp) equivalent to the input traces by construction. The method contains no self-definitional loops, no renaming of known results as novel derivations, and no load-bearing uniqueness theorems imported from the authors' prior work. The central result is an empirical comparison against the trace-generating agent and baselines, which is externally falsifiable on the stated benchmarks and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the approach relies on standard LLM prompting and clustering assumptions that are not enumerated.

pith-pipeline@v0.9.1-grok · 5712 in / 1162 out tokens · 16005 ms · 2026-06-28T10:52:31.461779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Snell, J

    C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In ICLR, 2025

  2. [2]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct : Synergizing reasoning and acting in language models. In ICLR, 2023

  3. [3]

    Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory. In ICML, 2025

  4. [4]

    Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried. Inducing programmatic skills for agentic tasks. In COLM, 2025

  5. [5]

    Zhang, J

    J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y. Luo, and C. Wu. AFlow : Automating agentic workflow generation. In ICLR (Oral), 2025

  6. [6]

    Z. Shi, Y. Zhu, J. Shi, X. Zhang, L. Wang, and C. Miao. STIR : Internalizing LLM reasoning via discovery and replay of latent actions. arXiv preprint arXiv:2602.04925, 2026

  7. [7]

    Sarch, L

    G. Sarch, L. Jang, M. J. Tarr, W. W. Cohen, K. Marino, and K. Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. In NeurIPS, 2024

  8. [8]

    P. Zhou, J. Pujara, X. Ren, X. Chen, H.-T. Cheng, Q. V. Le, E. H. Chi, D. Zhou, S. Mishra, and H. S. Zheng. Self-discover: Large language models self-compose reasoning structures. In NeurIPS, 2024

  9. [9]

    Khattab, A

    O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. DSPy : Compiling declarative language model calls into state-of-the-art pipelines. In ICLR, 2024

  10. [10]

    L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, and O. Khattab. GEPA : Reflective prompt evolution can outperform reinforcement learning. In ICLR (Oral), 2026

  11. [11]

    T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In ICLR, 2023

  12. [12]

    Kumar, V

    A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, L. M. Zhang, K. McKinney, D. Shrivastava, C. Paduraru, G. Tucker, D. Precup, F. Behbahani, and A. Faust. Training language models to self-correct via reinforcement learning. In ICLR, 2025

  13. [13]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, S. Narang, A. Chowdhery, D. Zhou, and E. H. Chi. Self-consistency improves chain of thought reasoning in language models. In ICLR, 2023

  14. [14]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

  15. [15]

    W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. TMLR, 2023

  16. [16]

    A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In ICML, 2024

  17. [17]

    Sprague, X

    Z. Sprague, X. Ye, K. Bostrom, S. Chaudhuri, and G. Durrett. MuSR : Testing the limits of chain-of-thought with multistep soft reasoning. In ICLR, 2024

  18. [18]

    C. A. Cohen and W. W. Cohen. Watch your steps: Observable and modular chains of thought. arXiv preprint arXiv:2409.15359, 2024

  19. [19]

    J. Leng, C. A. Cohen, Z. Zhang, C. Xiong, and W. W. Cohen. Semi-structured LLM reasoners can be rigorously audited. arXiv preprint arXiv:2505.24217, 2025

  20. [20]

    H. S. Zheng, S. Mishra, H. Zhang, X. Chen, M. Chen, A. Nova, L. Hou, H.-T. Cheng, Q. V. Le, E. H. Chi, and D. Zhou. NATURAL PLAN : Benchmarking LLM s on natural language planning. arXiv preprint arXiv:2406.04520, 2024

  21. [21]

    Ellis, C

    K. Ellis, C. Wong, M. Nye, M. Sabl \'e -Meyer, L. Morales, L. Hewitt, L. Cary, A. Solar-Lezama, and J. B. Tenenbaum. DreamCoder : Bootstrapping inductive program synthesis with wake-sleep library learning. In PLDI, 2021

  22. [22]

    Grand, L

    G. Grand, L. Wong, M. Bowers, T. X. Olausson, M. Liu, J. B. Tenenbaum, and J. Andreas. LILO : Learning interpretable libraries by compressing and documenting code. In ICLR, 2024

  23. [23]

    Bowers, T

    M. Bowers, T. X. Olausson, L. Wong, G. Grand, J. B. Tenenbaum, K. Ellis, and A. Solar-Lezama. Top-down synthesis for library learning. In POPL, 2023

  24. [24]

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager : An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  25. [25]

    Shinn, F

    N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion : Language agents with verbal reinforcement learning. In NeurIPS, 2023

  26. [26]

    A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. ExpeL : LLM agents are experiential learners. In AAAI, 2024

  27. [27]

    Zelikman, Y

    E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. STaR : Bootstrapping reasoning with reasoning. In NeurIPS, 2022

  28. [28]

    Y. Qu, A. Singh, Y. Lee, A. Setlur, R. Salakhutdinov, C. Finn, and A. Kumar. RLAD : Training LLM s to discover abstractions for solving reasoning problems. arXiv preprint arXiv:2510.02263, 2025

  29. [29]

    Hosseini, X

    A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, and R. Agarwal. V-STaR : Training verifiers for self-taught reasoners. In COLM, 2024

  30. [30]

    L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In ACL, 2023

  31. [31]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

  32. [32]

    T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou. Large language models as tool makers. In ICLR, 2024

  33. [33]

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL : Program-aided language models. In ICML, 2023

  34. [34]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In IEEE ICRA, 2023

  35. [35]

    Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-Burch. Faithful chain-of-thought reasoning. In IJCNLP-AACL, 2023

  36. [36]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023

  37. [37]

    Opsahl-Ong, M

    K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In EMNLP, 2024

  38. [38]

    Schick, J

    T. Schick, J. Dwivedi-Yu, R. Dess \`i , R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2023

  39. [39]

    Z. Wang, D. Fried, and G. Neubig. TroVE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In ICML, 2024

  40. [40]

    L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji. CRAFT : Customizing LLMs by creating and retrieving from specialized toolsets. In ICLR, 2024

  41. [41]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su. SkillWeaver : Web agents can self-improve by discovering and honing skills. arXiv preprint arXiv:2504.07079, 2025

  42. [42]

    R. Zhou, W. Hua, L. Pan, S. Cheng, X. Wu, E. Yu, and W. Y. Wang. RuleArena : A benchmark for rule-guided reasoning with LLMs in real-world scenarios. In ACL, 2025