pith. sign in

arxiv: 2606.12563 · v1 · pith:Z4JL2VPQnew · submitted 2026-06-10 · 💻 cs.AI

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Pith reviewed 2026-06-27 09:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent frameworktree searchshared working memoryautonomous agentsLLM inference optimizationOrchestrator-Critic architecturechecks-and-balances
0
0 comments X

The pith

Arbor maintains an explicit search tree of scored hypotheses as shared working memory so multi-agent systems can run stable, multi-day optimization in complex spaces like full-stack LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Arbor to give autonomous agents a cognition layer built on structured tree search rather than isolated stateless actions. Agents share an evolving tree of scored hypotheses that incorporates every measurement, converts failures into signals for redirecting exploration, and grows as successes change the bottlenecks. An Orchestrator delegates work to domain specialists while a Critic performs validation and root-cause checks, creating a checks-and-balances structure that neither can override alone. This setup supports fully autonomous campaigns lasting multiple days on LLM inference optimization, where a lone agent plateaus quickly and then crashes. The result is reported as substantially larger throughput-latency gains than vendor baselines while remaining reproducible across hardware generations.

Core claim

Arbor maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signals that reshape subsequent exploration, and expanding as prior successes shift the bottleneck distribution; it pairs an Orchestrator that drives optimization by delegating to Domain Specialists with a Critic that safeguards stability through root-cause analysis and measurement validation, enabling fully autonomous multi-day campaigns on full-stack LLM inference optimization.

What carries the argument

Explicit search tree of scored hypotheses maintained as shared working memory across agents, combined with Orchestrator-Critic checks-and-balances architecture.

If this is right

  • Delivers up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines.
  • Prevents the irrecoverable crashes that occur in single-agent runs within hours.
  • Generalizes across multiple generations of hardware platforms with run-to-run variance within 2 percentage points.
  • Allows agent capabilities to be decomposed into hard domain skills and soft coordination protocols that compose reliably.
  • Supports treating the search tree as evolving collective memory that shifts focus as bottlenecks move.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tree-as-memory structure could be tested on other stateful engineering optimization tasks outside LLM stacks.
  • Adding more specialized agents might extend the checks-and-balances pattern without losing the stability observed with two agents.
  • Low run-to-run variance suggests the method could support standardized autonomous pipelines that require minimal human oversight after setup.

Load-bearing premise

That an explicit shared search tree of scored hypotheses plus the dual-agent checks-and-balances will consistently convert failures into diagnostic signals that keep the campaign stable rather than letting errors accumulate into collapse.

What would settle it

A side-by-side multi-day run on the same LLM inference optimization task in which the single-agent baseline without the tree crashes irrecoverably while Arbor continues to improve and remains stable.

Figures

Figures reproduced from arXiv: 2606.12563 by Chaojun Hou, Emad Barsoum, Huasha Zhao, Mou Li, Neha Prakriya, Xi Zhao, Zheng Gong, Zhenyu Gu.

Figure 1
Figure 1. Figure 1: Arbor system architecture. The Orchestrator (left) drives the search loop: profiling the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Search tree for gpt-oss-120b (MoE 120B, MXFP4, AMD MI355X). The system explores [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput vs. interactivity on AMD MI355X for six production models. Interactivity is [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on gpt-oss-120b (MXFP4, MI355X). Each curve removes one component; all [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-agent communication during iterative diagnosis on one of the model optimizations [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Throughput vs. interactivity on AMD MI300X for 3 production models. Interactivity is [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Run-to-run variance across independent optimization campaigns. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Arbor, a multi-agent framework that uses structured tree search as a cognition layer for autonomous agents in large stateful action spaces. It describes an Orchestrator agent delegating to domain specialists paired with a Critic agent for stability checks, applied to full-stack LLM inference optimization. The central claims are up to 193% inference throughput-latency Pareto improvement over vendor baselines, multi-day campaign stability via shared scored-hypothesis trees and failure-as-diagnostic signals, generalization across hardware generations, and contrast with a single-agent baseline that plateaus at +33% and crashes irrecoverably within hours.

Significance. If the empirical claims hold under scrutiny, the work could be significant for autonomous agent research by showing how explicit search trees as shared memory plus dual-agent checks-and-balances can sustain stable, long-running optimization campaigns in complex domains. The reported low run-to-run variance and hardware-agnostic results would be a notable strength if substantiated.

major comments (2)
  1. [Abstract] Abstract: The performance and stability claims (193% Pareto improvement, single-agent crash within hours, multi-day stability via tree + Orchestrator-Critic) are asserted without any experimental protocol, benchmark descriptions, scoring function, failure-handling rules, measurement methodology, or dataset details. This gap is load-bearing because the central contrast between the multi-agent harness and the crashing baseline cannot be evaluated from the provided text.
  2. [Abstract] Abstract: No error bars, run counts, or validation steps are supplied for the throughput-latency numbers or the 2-percentage-point variance claim, preventing assessment of whether the reported outcomes follow from the architecture or from unstated implementation choices.
minor comments (1)
  1. The manuscript would benefit from an explicit experimental section (or appendix) detailing the optimization targets, tree scoring, and stability metrics to make the claims verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The concerns about missing experimental details and statistical reporting are valid for a self-contained abstract, and we will revise accordingly while preserving conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance and stability claims (193% Pareto improvement, single-agent crash within hours, multi-day stability via tree + Orchestrator-Critic) are asserted without any experimental protocol, benchmark descriptions, scoring function, failure-handling rules, measurement methodology, or dataset details. This gap is load-bearing because the central contrast between the multi-agent harness and the crashing baseline cannot be evaluated from the provided text.

    Authors: We agree the abstract omits these elements due to length limits. The full manuscript contains sections detailing the benchmark (full-stack LLM inference optimization), scoring (throughput-latency Pareto frontier), failure handling (Critic-driven root-cause analysis treating failures as diagnostic signals), and measurement protocol. To address the load-bearing gap, we will expand the abstract with a concise description of the domain, baseline, tree-based shared memory, and Orchestrator-Critic checks-and-balances. revision: yes

  2. Referee: [Abstract] Abstract: No error bars, run counts, or validation steps are supplied for the throughput-latency numbers or the 2-percentage-point variance claim, preventing assessment of whether the reported outcomes follow from the architecture or from unstated implementation choices.

    Authors: The reported 2-percentage-point variance is based on repeated multi-day campaigns across hardware generations, but the abstract does not include run counts or error bars. We will revise the abstract to state the number of independent runs and observed variance, and ensure the methods section explicitly describes the validation protocol so readers can assess reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system description without derivations

full rationale

The paper is a purely empirical description of a multi-agent framework (Arbor) that uses tree search as shared memory and an Orchestrator-Critic architecture. No equations, derivations, fitted parameters, or first-principles predictions appear in the abstract or full text. Performance claims (193% Pareto improvement, single-agent crash) are presented as measured experimental outcomes rather than results derived from any internal model or self-referential input. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way. The work is therefore self-contained against external benchmarks with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based solely on the abstract; no numerical free parameters, mathematical axioms, or independently evidenced invented entities are specified. The Orchestrator and Critic roles and the tree-search memory are introduced as part of the framework.

invented entities (2)
  • Orchestrator agent no independent evidence
    purpose: Drives optimization by delegating tasks to Domain Specialists across the inference stack
    Core component of the multi-agent architecture described in the abstract.
  • Critic agent no independent evidence
    purpose: Safeguards stability via root-cause analysis, introspection, and measurement validation
    Core component of the checks-and-balances architecture described in the abstract.

pith-pipeline@v0.9.1-grok · 5807 in / 1405 out tokens · 37838 ms · 2026-06-27T09:58:42.886480+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 7 canonical work pages

  1. [1]

    SWE-bench: Can Language Models Resolve Real-world Github Issues? , url =

    Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. SWE-bench: Can Language Models Resolve Real-world Github Issues? , url =

  2. [2]

    2024 , url=

    John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

  3. [3]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents , url =

    Wang, Xingyao and Li, Boxuan and Song, Yufan and Xu, Frank F and Tang, Xiangru and Zhuge, Mingchen and Pan, Jiayi and Song, Yueqi and Li, Bowen and Singh, Jaskirat and Tran, Hoang and Li, Fuqiang and Ma, Ren and Zheng, Mingzhang and Qian, Bill and Shao, Daniel and Muennighoff, Niklas and Zhang, Yizhe and Hui, Binyuan and Lin, Junyang and Brennan, Robert a...

  4. [4]

    2026 , url=

    Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen-mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, Humphrey Shi , journal=. 2026 , url=

  5. [5]

    Pawan Kumar, Emilien Dupont, Francisco J

    Romera-Paredes, Bernardino and Barekatain, Mohammadamin and Novikov, Alexander and Balog, Matej and Kumar, M. Pawan and Dupont, Emilien and Ruiz, Francisco J. R. and Ellenberg, Jordan S. and Wang, Pengming and Fawzi, Omar and Kohli, Pushmeet and Fawzi, Alhussein , title=. Nature , year=. doi:10.1038/s41586-023-06924-6 , url=

  6. [6]

    2025 , eprint=

    AlphaEvolve: A coding agent for scientific and algorithmic discovery , author=. 2025 , eprint=

  7. [7]

    2026 , eprint=

    AVO: Agentic Variation Operators for Autonomous Evolutionary Search , author=. 2026 , eprint=

  8. [8]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , url =

    Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R\'. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , url =. Advances in Neural Information Processing Systems , editor =

  9. [9]

    2026 , eprint=

    FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling , author=. 2026 , eprint=

  10. [10]

    Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =

    Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Cowan, Meghan and Shen, Haichen and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and Guestrin, Carlos and Krishnamurthy, Arvind , title =. Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation , pages =. 2018 , isbn =

  11. [11]

    and Stoica, Ion , title =

    Zheng, Lianmin and Jia, Chengfan and Sun, Minmin and Wu, Zhao and Yu, Cody Hao and Haj-Ali, Ameer and Wang, Yida and Yang, Jun and Zhuo, Danyang and Sen, Koushik and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation , articleno =. 2020 , isbn =

  12. [12]

    and Cao, Yuan and Narasimhan, Karthik , title =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  13. [13]

    2023 , html =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

  14. [14]

    Tillet, Philippe and Kung, H. T. and Cox, David , title =. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages =. 2019 , isbn =. doi:10.1145/3315508.3329973 , abstract =

  15. [15]

    Efficient memory management for large language model serving with PagedAttention, in: Proceed- ings of the 29th ACM Symposium on Operating Systems Principles, pp

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. Proceedings of the 29th Symposium on Operating Systems Principles , pages =. 2023 , isbn =. doi:10.1145/3600006.3613165 , abstract =

  16. [16]

    and Barrett, Clark and Sheng, Ying , title =

    Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark and Sheng, Ying , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  17. [17]

    Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , year =

    Ragan-Kelley, Jonathan and Barnes, Connelly and Adams, Andrew and Paris, Sylvain and Durand, Fr\'. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , year =. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation , pages =. doi:10.1145/2491956.24...

  18. [18]

    Evolution Through Large Models

    Lehman, Joel and Gordon, Jonathan and Jain, Shawn and Ndousse, Kamal and Yeh, Cathy and Stanley, Kenneth O. Evolution Through Large Models. Handbook of Evolutionary Machine Learning. 2024. doi:10.1007/978-981-99-3814-8_11

  19. [19]

    Learning to optimize halide with tree search and random programs , year =

    Adams, Andrew and Ma, Karima and Anderson, Luke and Baghdadi, Riyadh and Li, Tzu-Mao and Gharbi, Micha\". Learning to optimize halide with tree search and random programs , year =. ACM Trans. Graph. , month = jul, articleno =. doi:10.1145/3306346.3322967 , abstract =

  20. [20]

    Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems , pages =

    Schkufza, Eric and Sharma, Rahul and Aiken, Alex , title =. Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems , pages =. 2013 , isbn =. doi:10.1145/2451116.2451150 , abstract =

  21. [21]

    International Conference on Learning Representations , year=

    Neural Architecture Search with Reinforcement Learning , author=. International Conference on Learning Representations , year=

  22. [22]

    2019 , url=

    Hanxiao Liu and Karen Simonyan and Yiming Yang , booktitle=. 2019 , url=

  23. [23]

    , title =

    Snoek, Jasper and Larochelle, Hugo and Adams, Ryan P. , title =. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 , pages =. 2012 , publisher =

  24. [24]

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages =

    Non-stochastic Best Arm Identification and Hyperparameter Optimization , author =. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages =. 2016 , editor =

  25. [25]

    2024 , eprint=

    ChatDev: Communicative Agents for Software Development , author=. 2024 , eprint=

  26. [26]

    Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=

  27. [27]

    2011 , publisher=

    Kahneman, Daniel , title=. 2011 , publisher=

  28. [28]

    2026 , eprint=

    Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC , author=. 2026 , eprint=

  29. [29]

    2026 , eprint=

    KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization , author=. 2026 , eprint=

  30. [30]

    2025 , eprint=

    Astra: A Multi-Agent System for GPU Kernel Performance Optimization , author=. 2025 , eprint=

  31. [31]

    2026 , eprint=

    AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization , author=. 2026 , eprint=

  32. [32]

    2025 , eprint=

    Autonomous Code Evolution Meets NP-Completeness , author=. 2025 , eprint=

  33. [33]

    2026 , url=

    Juncheng Dong and Yang Yang and Tao Liu and Yang Wang and Feng Qi and Vahid Tarokh and Kaushik Rangadurai and Shuang Yang , booktitle=. 2026 , url=

  34. [34]

    2026 , howpublished =

  35. [35]

    Bandit Based Monte-Carlo Planning

    Kocsis, Levente and Szepesv \'a ri, Csaba. Bandit Based Monte-Carlo Planning. Machine Learning: ECML 2006. 2006

  36. [36]

    and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M

    Browne, Cameron B. and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Perez, Diego and Samothrakis, Spyridon and Colton, Simon , journal=. A Survey of Monte Carlo Tree Search Methods , year=