pith. sign in

arxiv: 2606.00189 · v1 · pith:KB6EGWOJnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Learning to Construct Practical Agentic Systems

Pith reviewed 2026-06-28 23:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords agentic systemsLLM agentspseudo-toolsfixed workflowsmulti-objective optimizationmodularitycost optimizationworkflow learning
0
0 comments X

The pith

A framework using pseudo-tools for modularity makes fixed workflows in LLM agents cheaper and more accurate than dynamic planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an agent framework that defines pseudo-tools to call LLMs recursively on restricted contexts, enforcing modularity in agentic systems. Using this setup, hand-engineered fixed workflows for diverse tasks prove generally cheaper and more accurate than dynamically planned workflows. Novel learning methods for pseudo-tools and fixed workflows then outperform the hand-engineered versions. The same modularity supports multi-objective optimization that jointly tunes inference cost and response quality.

Core claim

By defining pseudo-tools that call LLMs recursively on a restricted context, the framework enables designers to enforce modularity in agentic LLM systems. Hand-constructed fixed workflows built in this framework are generally cheaper and more accurate than dynamically-planned workflows across diverse tasks. Learning methods for the required pseudo-tools and workflows outperform the hand-engineered agents, and the modularity permits multi-objective optimization to balance cost against quality.

What carries the argument

The pseudo-tool mechanism, which enforces modularity by restricting the context passed to recursive LLM calls.

If this is right

  • Hand-constructed fixed workflows outperform dynamically-planned workflows on both cost and accuracy for the tested tasks.
  • Learning methods for pseudo-tools and workflows produce agents that beat hand-engineered baselines.
  • Multi-objective optimization can be applied directly to the modular components to trade off inference cost against response quality.
  • Production agentic systems can prioritize simplicity and predictability by favoring fixed workflows over dynamic planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modularity could make it easier to update or debug individual agent components without retraining the whole system.
  • If context restriction works reliably, similar designs might reduce variance in real-world deployment costs.
  • The approach suggests that learning fixed structures may be more sample-efficient than learning full dynamic planners.

Load-bearing premise

The pseudo-tool mechanism successfully enforces useful modularity and restricted context without losing necessary information or introducing new failure modes that offset the gains in cost and accuracy.

What would settle it

A direct comparison on the paper's task set where fixed workflows show higher total cost or lower accuracy than dynamic planning once context restriction is applied.

Figures

Figures reproduced from arXiv: 2606.00189 by Aditya Kumar, Arthur Kajiyama, Cassandra A. Cohen, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, William W. Cohen, Zhihan Lei.

Figure 1
Figure 1. Figure 1: Left, the accuracy of engineered static workflows compared to ReAct; right, accuracy of workflows compared to a traditional zero-shot tuned prompt. Each box is a benchmark, where the y-axis position is mean workflow accuracy and the x-axis position is ReAct, so boxes above the line y = x show that workflow performs best. Box width is the standard error of the mean. 6 [PITH_FULL_IMAGE:figures/full_fig_p006… view at source ↗
Figure 2
Figure 2. Figure 2: Left, test split cost vs accuracy of ptool-distill under the Claude Opus 4.6 Anthropic [2026] learner; right, under Gemini 3.1 Pro Preview. The lighter circle is the original hand-written workflow, the darker square is the same workflow after all ptools (excluding the root interfaces)have been processed by code distillation Arrows pointing up-and-left are lower cost and higher accuracy. Benchmarks means do… view at source ↗
Figure 3
Figure 3. Figure 3: Pareto-optimal configurations found for two benchmarks by the NSGA-II optimized. Left: For NBA, green points are engineered workflows with a single LLM; config 0001 switched between DeepSeek-V3 and Gemini 2.5 Flash-Lite Comanici et al. [2025], using induced ptools and ReAct; configs 011, 005 are the engineered baseline with ptools routed to a gpt-oss-20b OpenAI [2025] model; config 009 is the same but only… view at source ↗
Figure 1
Figure 1. Figure 1: B.2 Pseudo-tool induction experiments For each subtask we evaluated two action spaces under the same ReAct framework (react_pydantic, a clean ReAct preamble over a pydantic-ai loop), the same backbone model (DeepSeek-V3), and the same test split (seed 42): engineered, the hand-written multi-step pipeline, 15 [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: NSGA-II Pareto frontier on MuSR/object_placements (cost on the validation split, USD per [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RuleArena/NBA: engineered six-model react_learned baseline (green) versus the NSGA-II frontier (blue); validation cost (USD/case, linear axis) versus correctness. Squares mark frontier-optimal, diamonds mark dominated. Point labels show each configuration’s top-level method and model assignment(s). NSGA-II recovers the high-accuracy DeepSeek-V3 baseline (top-right) and finds two cheaper frontier configurat… view at source ↗
read the original abstract

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining "pseudo-tools" that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an agent framework using 'pseudo-tools' for enforcing modularity in LLM-based agentic systems through recursive calls on restricted contexts. It shows that hand-constructed fixed workflows are generally cheaper and more accurate than dynamically-planned workflows, introduces learning methods for pseudo-tools and fixed workflows that outperform hand-engineered agents, and applies multi-objective optimization to jointly optimize cost and response quality.

Significance. If the empirical findings hold, this work addresses important practical issues in agentic systems such as controllability, predictability of costs, and simplicity, which are prioritized in production over sophisticated but complex designs. The combination of hand-engineering, learning, and multi-objective optimization provides a principled approach that could influence how agentic systems are built and optimized in practice.

major comments (2)
  1. [Abstract] The abstract asserts empirical superiority of fixed workflows and learned components but provides no quantitative results, baselines, task definitions, or statistical details, making it impossible to assess the validity of the central claims without the full experimental section.
  2. [Pseudo-tool mechanism] The framework's promise hinges on pseudo-tools delivering modularity and lower cost without losing critical information. No ablation on context restriction, information retention metrics, or comparison with unrestricted context is mentioned to confirm that the restriction does not introduce offsetting failure modes.
minor comments (1)
  1. The abstract could benefit from a brief mention of the specific tasks or domains used in the experiments to provide context for the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review. We address each major comment below, agreeing where the manuscript can be strengthened through revision.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts empirical superiority of fixed workflows and learned components but provides no quantitative results, baselines, task definitions, or statistical details, making it impossible to assess the validity of the central claims without the full experimental section.

    Authors: We agree that the abstract would be more informative with quantitative highlights. The body of the paper contains the requested details on tasks, baselines, metrics, and statistics. In revision we will expand the abstract to include key results such as the observed accuracy gains and cost reductions for fixed workflows versus dynamic planning. revision: yes

  2. Referee: [Pseudo-tool mechanism] The framework's promise hinges on pseudo-tools delivering modularity and lower cost without losing critical information. No ablation on context restriction, information retention metrics, or comparison with unrestricted context is mentioned to confirm that the restriction does not introduce offsetting failure modes.

    Authors: The current evaluation focuses on end-to-end task performance. We acknowledge that targeted ablations would better isolate the effect of context restriction. In the revised manuscript we will add an ablation comparing restricted versus unrestricted contexts together with information-retention metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical framework using pseudo-tools for modularity in agentic LLM systems, hand-engineered agents, and learning methods for pseudo-tools and workflows. Claims rest on experimental comparisons of cost/accuracy vs. dynamic planning, with multi-objective optimization. No equations, derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described content. The derivation chain is self-contained against external benchmarks via empirical results rather than internal algebraic reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the introduction of the pseudo-tool abstraction and the empirical premise that fixed modular workflows are preferable for production use. No numerical free parameters are mentioned. The pseudo-tool itself is an invented modeling construct.

axioms (1)
  • domain assumption LLMs can usefully operate on deliberately restricted contexts without catastrophic loss of necessary information.
    Required for the pseudo-tool recursion to preserve task performance while enforcing modularity.
invented entities (1)
  • pseudo-tool no independent evidence
    purpose: A callable that invokes an LLM recursively on a restricted context to enforce modularity in agent workflows.
    New construct introduced to enable the framework's modularity guarantees.

pith-pipeline@v0.9.1-grok · 5735 in / 1427 out tokens · 23224 ms · 2026-06-28T23:02:51.188628+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Forty-first International Conference on Machine Learning , year=

    Gptswarm: Language agents as optimizable graphs , author=. Forty-first International Conference on Machine Learning , year=

  2. [2]

    Recursive Language Models

    Recursive language models , author=. arXiv preprint arXiv:2512.24601 , year=

  3. [3]

    Authorea Preprints , year=

    Agentic AI for Emergency Response and Comparative Analysis of SmolAgents, LangGraph, AutoGen, Agno AGI and CrewAI for Crisis Solution , author=. Authorea Preprints , year=

  4. [4]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

  5. [5]

    2025 , eprint=

    A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems , author=. 2025 , eprint=

  6. [6]

    2024 7th International Conference on Contemporary Computing and Informatics (IC3I) , volume=

    A comparative study on using genetic algorithm with DEAP framework on different optimization problems , author=. 2024 7th International Conference on Contemporary Computing and Informatics (IC3I) , volume=. 2024 , organization=

  7. [7]

    Designbench: A comprehensive benchmark for mllm-based front-end code generation.arXiv preprint arXiv:2506.06251, 2025

    Designbench: A comprehensive benchmark for mllm-based front-end code generation , author=. arXiv preprint arXiv:2506.06251 , year=

  8. [8]

    2026 , eprint=

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents , author=. 2026 , eprint=

  9. [9]

    2026 , eprint=

    Measuring Agents in Production , author=. 2026 , eprint=

  10. [10]

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , journal=

  11. [11]

    Executable Code Actions Elicit Better

    Wang, Xingyao and Chen, Yangyi and Yuan, Lifan and Zhang, Yizhe and Li, Yunzhu and Peng, Hao and Ji, Heng , journal=. Executable Code Actions Elicit Better

  12. [12]

    and Burger, Doug and Wang, Chi , journal=

    Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , journal=

  13. [13]

    International Conference on Machine Learning (ICML) , year=

    Zhuge, Mingchen and Wang, Wenyi and Kirsch, Louis and Faccio, Francesco and Khizbullin, Dmitrii and Schmidhuber, J. International Conference on Machine Learning (ICML) , year=

  14. [14]

    Zhang, Jinhao and Xiang, Dian and Yu, Ao and others , journal=

  15. [15]

    Automated Design of Agentic Systems

    Automated Design of Agentic Systems , author=. arXiv preprint arXiv:2408.08435 , year=

  16. [16]

    and Mober, Hanna and Shah, Pawan Kumar and Edalati, Neel and Lee, Caleb and Shin, Richard and Potts, Christopher and Zaharia, Matei , journal=

    Khattab, Omar and Singhvi, Arnav and Maheshwari, Paridhi and Zhang, Zhiyuan and Santhanam, Keshav and Vardhamanan, Sri and Haq, Saiful and Sharma, Ashutosh and Joshi, Thomas T. and Mober, Hanna and Shah, Pawan Kumar and Edalati, Neel and Lee, Caleb and Shin, Richard and Potts, Christopher and Zaharia, Matei , journal=

  17. [17]

    Yuksekgonul, Mert and Bianchi, Federico and Boen, Joseph and Liu, Sheng and Huang, Zhi and Guestrin, Carlos and Zou, James , journal=

  18. [18]

    Ye, Hongqiu and others , journal=

  19. [19]

    Challenging

    Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics (ACL Findings) , year=

  20. [20]

    International Conference on Learning Representations (ICLR) , year=

    Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning , author=. International Conference on Learning Representations (ICLR) , year=

  21. [21]

    Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang , journal=

  22. [22]

    Zhou, Ruiwen and Hua, Wenyue and Pan, Liangming and Cheng, Sitao and Wu, Xiaobao and Yu, En and Wang, William Yang , journal=

  23. [23]

    Sprague, Zayne and Ye, Xi and Bostrom, Kaj and Chaudhuri, Swarat and Durrett, Greg , journal=

  24. [24]

    and Chi, Ed H

    Zheng, Huaixiu Steven and Mishra, Swaroop and Zhang, Hugh and Chen, Xinyun and Chen, Minmin and Nova, Azade and Hou, Le and Cheng, Heng-Tze and Le, Quoc V. and Chi, Ed H. and Zhou, Denny , journal=

  25. [25]

    and Anwar, Zain and Sarfo-Gyamfi, Maame and Safranek, Conrad W

    Khandekar, Nikhil and Jin, Qiao and Xiong, Guangzhi and Dunn, Soren and Applebaum, Serina S. and Anwar, Zain and Sarfo-Gyamfi, Maame and Safranek, Conrad W. and Anwar, Abid A. and Zhang, Andrew and Gilson, Aidan and Singer, Maxwell B. and Dave, Amisha and Taylor, Andrew and Zhang, Aidong and Chen, Qingyu and Lu, Zhiyong , journal=

  26. [26]

    and Geng, Gloria and Park, Danny and Zou, James and Ng, Andrew Y

    Jiang, Yixing and Black, Kameron C. and Geng, Gloria and Park, Danny and Zou, James and Ng, Andrew Y. and Chen, Jonathan H. , journal=. 2025 , note=

  27. [27]

    Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

  28. [28]

    International Conference on Machine Learning (ICML) , year=

    Agent Workflow Memory , author=. International Conference on Machine Learning (ICML) , year=

  29. [29]

    Conference on Language Modeling (COLM) , year=

    Inducing Programmatic Skills for Agentic Tasks , author=. Conference on Language Modeling (COLM) , year=

  30. [30]

    Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , journal=

  31. [31]

    Wang, Zhiruo and Fried, Daniel and Neubig, Graham , journal=

  32. [32]

    International Conference on Learning Representations (ICLR) , year=

    Large Language Models as Tool Makers , author=. International Conference on Learning Representations (ICLR) , year=

  33. [33]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and others , year=. 2507.06261 , archivePrefix=

  34. [34]

    2026 , howpublished=

  35. [35]

    Journal of Machine Learning Research , year =

    F. Journal of Machine Learning Research , year =

  36. [36]

    , title=

    Deb, Kalyanmoy and Pratap, Amrit and Agarwal, Sameer and Meyarivan, T. , title=. IEEE Transactions on Evolutionary Computation , volume=

  37. [37]

    2023 , howpublished =

  38. [38]

    2024 , howpublished =

  39. [39]

    2025 , howpublished =

    Roucher, Aymeric and Villanova del Moral, Albert and Wolf, Thomas and von Werra, Leandro and Kaunism. 2025 , howpublished =

  40. [40]

    2026 , howpublished =