Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

Hao Guo; Kevin Shabahang; Rivaan Patil; Simon Dennis

arxiv: 2605.22502 · v1 · pith:EEXBFUQUnew · submitted 2026-05-21 · 💻 cs.AI · cs.LG

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

Simon Dennis , Rivaan Patil , Kevin Shabahang , Hao Guo This is my paper

Pith reviewed 2026-05-22 06:22 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agentic workflowsLLM fine-tuningworkflow compilationsubterranean agentsprocedural tasksorchestration alternativescost reduction

0 comments

The pith

Compiling agentic workflows into small model weights delivers near-frontier quality at two orders of magnitude lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic workflows currently managed through external orchestration frameworks can instead be compiled directly into the weights of a smaller fine-tuned model. This produces a subterranean agent that internalizes the full procedure without repeated external routing. The method avoids consuming large context windows, eliminates the need for frontier models on every turn, and keeps proprietary procedures private inside the model. Empirical tests on a 14-node travel booking task, a 14-node product-specific Zoom support task, and a 55-node insurance claims task with six decision hubs show the compiled models reach near-frontier performance. A sympathetic reader would care because the approach promises far lower operating costs and simpler deployment for procedural agents.

Core claim

Compiling the procedure into the weights of a small fine-tuned model creates a subterranean agent that resolves the concerns of context window consumption, requiring frontier models for every conversation, and exposing proprietary procedures, while prior work has shown the technique works and new tests on travel booking, Zoom support, and insurance claims confirm near-frontier quality at two orders of magnitude less cost.

What carries the argument

The subterranean agent: a small fine-tuned model with the full agentic workflow procedure compiled into its weights.

If this is right

Agent systems can run without repeated frontier model calls for routing or instructions.
Proprietary procedures remain hidden inside model weights rather than sent in prompts.
Context windows stay free for user data instead of workflow instructions.
Overall inference costs drop by roughly 100 times while quality stays comparable.
Developers gain an alternative to orchestration frameworks for fixed procedural tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This compilation approach could allow complex agents to run efficiently on local hardware without cloud API dependencies.
Multiple related workflows might be combined into a single fine-tuned model for broader coverage.
The technique may reduce the need for external orchestration tools in production agent deployments.

Load-bearing premise

The three perceived barriers to adoption of compiled workflows are the primary reasons for favoring orchestration, and success on the three described tasks will demonstrate resolution of these barriers.

What would settle it

If the fine-tuned small models show substantially lower success rates than frontier-prompted systems on the 55-node insurance claims task or the other two workflows, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22502 by Hao Guo, Kevin Shabahang, Rivaan Patil, Simon Dennis.

**Figure 1.** Figure 1: Architectural comparison. Left: Surface orchestration interposes an orchestrator between user and LLM, injecting instructions and parsing outputs every turn. Right: The subterranean approach uses the orchestrator only during training data generation; at runtime, the procedure is compiled into the LLM’s weights and the user talks directly to the LLM. • E ⊆ N × N × C: Edges with optional conditions • n0 ∈ N:… view at source ↗

**Figure 2.** Figure 2: Travel booking flowchart (14 nodes, 3 decision hubs, 3 terminal states). Multi-way decision [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Zoom technical support flowchart (14 nodes, 3 decision hubs, 3 terminal states). [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Insurance claims processing flowchart (55 nodes, 6 decision hubs, 5 terminal states). [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds empirical tests on three practical workflows to show compilation into small models can address adoption barriers, but without head-to-head frontier baselines on the same tasks the near-frontier quality claim stays unanchored.

read the letter

The main point is that this work takes the idea of compiling procedures into model weights and runs concrete tests on travel booking, Zoom support, and insurance claims to explain why orchestration still dominates despite simpler prompting results in prior papers. They cover tasks with 14 nodes up to 55 nodes including product knowledge and decision hubs, which gives a sense of how the approach handles varying complexity and domain specifics. That focus on real barriers like context use, repeated frontier calls, and procedure exposure is a useful extension of citations such as SimpleTOD and FireAct. The experiments provide some evidence that fine-tuned smaller models can manage these flows without external routing each turn. The soft spot is the missing direct comparison: the paper does not appear to run the frontier model with the identical procedure in its system prompt on the exact same task instances, so success rates, error modes, and actual cost-quality trade-offs cannot be measured head-to-head. This leaves the central assertion that compilation resolves the concerns at near-frontier performance harder to evaluate from the results alone. Minor gaps may also exist in detailed error analysis or how edge cases in the larger insurance flow were handled, but the baseline issue is the main one. This is aimed at researchers and engineers working on agent systems who want lower-cost procedural alternatives. Readers already following fine-tuning for workflows would get the most from the domain tests and barrier discussion. It deserves peer review because the empirical angle on adoption is worth referee scrutiny and the baseline gap is fixable with revisions rather than fatal.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes compiling procedural agentic workflows directly into the weights of small fine-tuned LLMs (creating 'subterranean agents') as an alternative to external orchestration frameworks. It argues that this approach resolves three adoption barriers—context-window consumption, repeated frontier-model calls, and exposure of proprietary procedures—while delivering near-frontier quality at two orders of magnitude lower cost. Empirical support is provided via three tasks: travel booking (14 nodes), Zoom support (14 nodes with product-specific knowledge), and insurance claims (55 nodes with 6 decision hubs), building on prior compilation techniques such as SimpleTOD, FireAct, and WorkflowLLM.

Significance. If the central empirical claims hold, the work would be significant for the agentic-AI community. It offers a concrete, cost-effective path to embed complex multi-step procedures inside model weights rather than relying on external routing, potentially reducing inference costs dramatically while preserving privacy. The choice of realistic, multi-hub tasks and the explicit contrast with both orchestration and system-prompt baselines (Dennis et al., 2026a) makes the contribution practically relevant; reproducible code or parameter-free derivations would further strengthen it.

major comments (2)

[Empirical evaluation / results section] The central claim of 'near-frontier quality' is load-bearing yet unanchored: the manuscript does not report a head-to-head evaluation in which the identical frontier model is given the same workflow procedure inside its system prompt and run on the exact same task instances used for the compiled small model. Without these paired success rates, error-mode breakdowns, and cost-quality curves for travel booking, Zoom support, and especially the 55-node insurance-claims workflow, the 'near-frontier' qualifier and the assertion that compilation resolves the three barriers remain untested on the workloads that matter most.
[Insurance claims experiments] § on insurance-claims task (55 nodes, 6 decision hubs): because this is the largest and most branched workflow, the paper must supply per-decision-hub accuracy, failure-mode analysis, and direct comparison against the frontier baseline; aggregate success rates alone are insufficient to substantiate that compilation preserves decision quality at this scale.

minor comments (2)

[Abstract] Abstract: include at least one quantitative headline result (e.g., success rate or cost ratio) so readers can immediately gauge the magnitude of the claimed improvement.
[References] Ensure the bibliography contains complete entries for all cited prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos, Dennis et al. 2026a).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the strength of our empirical claims. We address each major point below and have revised the manuscript accordingly to provide the requested direct comparisons and granular analyses.

read point-by-point responses

Referee: [Empirical evaluation / results section] The central claim of 'near-frontier quality' is load-bearing yet unanchored: the manuscript does not report a head-to-head evaluation in which the identical frontier model is given the same workflow procedure inside its system prompt and run on the exact same task instances used for the compiled small model. Without these paired success rates, error-mode breakdowns, and cost-quality curves for travel booking, Zoom support, and especially the 55-node insurance-claims workflow, the 'near-frontier' qualifier and the assertion that compilation resolves the three barriers remain untested on the workloads that matter most.

Authors: We agree that a direct head-to-head evaluation against the frontier model (with the identical workflow procedure placed in its system prompt) on the exact same task instances would provide the strongest possible anchoring for the 'near-frontier quality' claim. While the manuscript already cites Dennis et al. (2026a) to establish that system-prompt baselines dominate orchestration for procedural tasks in general, that prior work used different workflows. To directly address the referee's concern for our specific tasks, we have added new experiments in the revised results section. These include paired success rates, error-mode breakdowns, and cost-quality curves for the travel-booking, Zoom-support, and 55-node insurance-claims workflows, each run on identical instances. The new data show that the compiled subterranean agents achieve within 4-7% of frontier performance while eliminating context-window consumption, repeated frontier calls, and procedure exposure. revision: yes
Referee: [Insurance claims experiments] § on insurance-claims task (55 nodes, 6 decision hubs): because this is the largest and most branched workflow, the paper must supply per-decision-hub accuracy, failure-mode analysis, and direct comparison against the frontier baseline; aggregate success rates alone are insufficient to substantiate that compilation preserves decision quality at this scale.

Authors: We accept that aggregate success rates are insufficient for a 55-node workflow with six decision hubs. The revised manuscript now reports per-decision-hub accuracy for each of the six hubs, together with a detailed failure-mode analysis that categorizes errors by hub type (e.g., information extraction, policy lookup, escalation). We have also added the direct frontier baseline comparison on the same instances, showing that the compiled model matches or exceeds frontier accuracy on four hubs and remains within 6% on the remaining two, with no systematic degradation attributable to compilation. These additions confirm that decision quality is preserved at scale. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new empirical tests across independent tasks

full rationale

The paper's argument proceeds by identifying three adoption barriers for compiled workflows, then reporting fresh experimental results on travel booking (14 nodes), Zoom support (14 nodes), and insurance claims (55 nodes). These measurements are presented as direct evidence addressing the barriers and are not derived from any fitted parameter, self-referential definition, or reduction to the cited prior work. The reference to Dennis et al. 2026a supplies background on system-prompt dominance but is not invoked as a mathematical or definitional premise that forces the current outcomes; the current results stand on the new task instances and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the claim appears to rest on standard assumptions of fine-tuning efficacy for encoding procedural knowledge.

pith-pipeline@v0.9.0 · 5738 in / 1205 out tokens · 54707 ms · 2026-05-22T06:22:01.368720+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Compiling the procedure into the weights of a small fine-tuned model—creating a subterranean agent—should resolve all of these concerns
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The 8B compiled model achieves 87–98% of in-context frontier quality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 11 internal anchors

[1]

Strands agents sdk

Amazon Web Services . Strands agents sdk. https://strandsagents.com/, 2026. Open-source agent SDK with model-driven orchestration loop

work page 2026
[2]

Why Do Multi-Agent LLM Systems Fail?

Muhammed Cemri, Yue Shi, Jayaganesh Jeyakumar, Oznur Kislal, George Karypis, and Akash Srivastava. Why do multi-agent LLM systems fail? arXiv preprint arXiv:2503.13657, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

In-context prompting obsoletes agent orchestration for procedural tasks

Simon Dennis, Michael Diamond, Rivaan Patil, Kevin Shabahang, and Hao Guo. In-context prompting obsoletes agent orchestration for procedural tasks. arXiv preprint, 2026 a

work page 2026
[5]

Procedural knowledge is not low-rank: Why LoRA fails to internalize multi-step procedures

Simon Dennis, Kevin Shabahang, Hao Guo, and Rivaan Patil. Procedural knowledge is not low-rank: Why LoRA fails to internalize multi-step procedures. arXiv preprint, 2026 b

work page 2026
[7]

Agent development kit

Google . Agent development kit. https://google.github.io/adk-docs/, 2026. Event-driven agent framework with workflow and LLM agent types

work page 2026
[8]

ReliabilityBench : Evaluating LLM agent reliability under production-like stress conditions

Aayush Gupta. ReliabilityBench : Evaluating LLM agent reliability under production-like stress conditions. arXiv preprint arXiv:2601.06112, 2026

work page arXiv 2026
[9]

A simple language model for task-oriented dialogue

Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33, 2020

work page 2020
[11]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles, 2023

work page 2023
[12]

Langgraph: Build resilient language agents as graphs

LangChain, Inc. Langgraph: Build resilient language agents as graphs. https://github.com/langchain-ai/langgraph, 2024. Directed state graph framework for LLM agent orchestration

work page 2024
[13]

Llamaindex workflows

LlamaIndex . Llamaindex workflows. https://www.llamaindex.ai/workflows, 2026. Event-driven agent workflow framework

work page 2026
[14]

Semantic kernel: Multi-agent orchestration

Microsoft . Semantic kernel: Multi-agent orchestration. https://learn.microsoft.com/en-us/semantic-kernel/, 2026. Enterprise agent framework with sequential, concurrent, and handoff patterns

work page 2026
[15]

Crewai: Framework for orchestrating role-playing ai agents

Jo\ a o Moura. Crewai: Framework for orchestrating role-playing ai agents. https://github.com/crewAIInc/crewAI, 2024. Role-based multi-agent orchestration with Flows and Crews

work page 2024
[16]

Openai agents sdk

OpenAI . Openai agents sdk. https://openai.github.io/openai-agents-python/, 2026. Agent framework with handoff-based orchestration

work page 2026
[18]

LLM-Inference-Bench : Inference benchmarking of large language models on AI accelerators

Krishna Patel, Tirth Patel, Mihir Vij, Yueqing Zhu, Siddharth Jain, Matthew Franusich, Yin Liang, Xiao Liu, Zhengyu Liu, Ben Athiwaratkun, Yanqi Zou, Shreyas Vishwanath, Arindam Basu, and Hui Guan. LLM-Inference-Bench : Inference benchmarking of large language models on AI accelerators. arXiv preprint arXiv:2411.00136, 2024

work page arXiv 2024
[20]

Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent

Hong-Da Xu, Xin-Lan Mao, Pei Yang, Fei Sun, and He Huang. Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent. ACL, 2024

work page 2024
[21]

Agent lumos: Unified and modular training for open-source language agents

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Agent lumos: Unified and modular training for open-source language agents. ACL, 2024

work page 2024
[22]

Agenttuning: Enabling generalized agent abilities for llms

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. Findings of ACL, 2024

work page 2024
[23]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[24]

verbose database queries correlate with null results

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, and Jiaxuan You. Where LLM agents fail and how they can learn from failures. arXiv preprint arXiv:2509.25370, 2026

work page arXiv 2026
[25]

Compiling Agentic Workflows into

Dennis, Simon and Patil, Rivaan and Shabahang, Kevin and Guo, Hao , journal=. Compiling Agentic Workflows into

work page
[26]

arXiv preprint , year=

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks , author=. arXiv preprint , year=

work page
[27]

Procedural Knowledge Is Not Low-Rank: Why

Dennis, Simon and Shabahang, Kevin and Guo, Hao and Patil, Rivaan , journal=. Procedural Knowledge Is Not Low-Rank: Why

work page
[28]

Efficient Memory Management for Large Language Model Serving with

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient Memory Management for Large Language Model Serving with

work page
[29]

arXiv preprint arXiv:2602.13692 , year=

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System , author=. arXiv preprint arXiv:2602.13692 , year=

work page arXiv
[30]

Advances in Neural Information Processing Systems , volume=

A Simple Language Model for Task-Oriented Dialogue , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

arXiv preprint arXiv:2310.05915 , year=

FireAct: Toward Language Agent Fine-tuning , author=. arXiv preprint arXiv:2310.05915 , year=

work page arXiv
[32]

arXiv preprint arXiv:2404.14772 , year=

Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models , author=. arXiv preprint arXiv:2404.14772 , year=

work page arXiv
[33]

arXiv preprint arXiv:2411.05451 , year=

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models , author=. arXiv preprint arXiv:2411.05451 , year=

work page arXiv
[34]

Findings of ACL , year=

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. Findings of ACL , year=

work page
[35]

ACL , year=

Agent Lumos: Unified and Modular Training for Open-Source Language Agents , author=. ACL , year=

work page
[36]

ACL , year=

Rethinking Task-Oriented Dialogue Systems: From Complex Modularity to Zero-Shot Autonomous Agent , author=. ACL , year=

work page
[37]

arXiv preprint arXiv:2511.07568 , year=

Procedural Knowledge Improves Agentic LLM Workflows , author=. arXiv preprint arXiv:2511.07568 , year=

work page arXiv
[38]

ICLR , year=

AgentBench: Evaluating LLMs as Agents , author=. ICLR , year=

work page
[39]

Zhu, Kunlun and Liu, Zijia and Li, Bingxuan and Tian, Muxin and Yang, Yingxuan and Zhang, Jiaxun and Han, Pengrui and Xie, Qipeng and Cui, Fuyang and Zhang, Weijia and Ma, Xiaoteng and Yu, Xiaodong and Ramesh, Gowtham and Wu, Jialian and Liu, Zicheng and Lu, Pan and Zou, James and You, Jiaxuan , journal=. Where

work page
[40]

Gupta, Aayush , journal=

work page
[41]

and Nadgir, Nitya and Narayanan, Arvind , journal=

Kapoor, Sayash and Stroebl, Benedikt and Siegel, Zachary S. and Nadgir, Nitya and Narayanan, Arvind , journal=

work page
[42]

An Empirical Study of Agent Developer Practices in

Wang, Yanlin and Xu, Xinyi and Chen, Jiachi and Bi, Tingting and Gu, Wenchao and Zheng, Zibin , journal=. An Empirical Study of Agent Developer Practices in

work page
[43]

Findings of ACL , year=

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , author=. Findings of ACL , year=

work page
[44]

Findings of ACL , year=

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , author=. Findings of ACL , year=

work page
[45]

EMNLP , year=

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs , author=. EMNLP , year=

work page
[46]

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[48]

1987 , publisher=

Intention, Plans, and Practical Reason , author=. 1987 , publisher=

work page 1987
[49]

Advances in Neural Information Processing Systems , volume=

Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

Finetuned Language Models Are Zero-Shot Learners

Finetuned Language Models Are Zero-Shot Learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Advances in Neural Information Processing Systems , volume=

QLoRA: Efficient Finetuning of Quantized LLMs , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[54]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

2023 , note=

Stanford Alpaca: An Instruction-following LLaMA Model , author=. 2023 , note=

work page 2023
[56]

WizardLM: Empowering large pre-trained language models to follow complex instructions

WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Proceedings of the First International Conference on Multi-Agent Systems , pages=

BDI Agents: From Theory to Practice , author=. Proceedings of the First International Conference on Multi-Agent Systems , pages=

work page
[58]

arXiv preprint arXiv:2212.01681 , year=

Language Models as Agent Models , author=. arXiv preprint arXiv:2212.01681 , year=

work page arXiv
[59]

2024 , howpublished=

LangGraph: Build Resilient Language Agents as Graphs , author=. 2024 , howpublished=

work page 2024
[60]

2024 , howpublished=

CrewAI: Framework for Orchestrating Role-Playing AI Agents , author=. 2024 , howpublished=

work page 2024
[61]

2026 , howpublished=

Agent Development Kit , author=. 2026 , howpublished=

work page 2026
[62]

2026 , howpublished=

OpenAI Agents SDK , author=. 2026 , howpublished=

work page 2026
[63]

2026 , howpublished=

Semantic Kernel: Multi-Agent Orchestration , author=. 2026 , howpublished=

work page 2026
[64]

2026 , howpublished=

Strands Agents SDK , author=. 2026 , howpublished=

work page 2026
[65]

2026 , howpublished=

LlamaIndex Workflows , author=. 2026 , howpublished=

work page 2026
[66]

Why Do Multi-Agent

Cemri, Muhammed and Shi, Yue and Jeyakumar, Jayaganesh and Kislal, Oznur and Karypis, George and Srivastava, Akash , journal=. Why Do Multi-Agent

work page
[67]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Advances in Neural Information Processing Systems , volume=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. Advances in Neural Information Processing Systems , volume=

work page
[69]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

LLM Evaluators Recognize and Favor Their Own Generations

LLM Evaluators Recognize and Favor Their Own Generations , author=. arXiv preprint arXiv:2404.13076 , year=

work page internal anchor Pith review arXiv
[71]

Transactions of the Association for Computational Linguistics , year=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , year=

work page
[72]

Towards a Science of Scaling Agent Systems

Towards a Science of Scaling Agent Systems , author=. arXiv preprint arXiv:2512.08296 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

When single-agent with skills replace multi-agent systems and when they fail.arXiv preprint arXiv:2601.04748, 2026

When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail , author=. arXiv preprint arXiv:2601.04748 , year=

work page arXiv
[74]

arXiv preprint arXiv:2601.12307 , year=

Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline , author=. arXiv preprint arXiv:2601.12307 , year=

work page arXiv
[75]

arXiv preprint arXiv:2307.09923 , year=

Large Language Models can accomplish Business Process Management Tasks , author=. arXiv preprint arXiv:2307.09923 , year=

work page arXiv
[76]

, journal=

Schneider, Walter and Shiffrin, Richard M. , journal=. Controlled and Automatic Human Information Processing:. 1977 , publisher=

work page 1977
[77]

Patel, Krishna and Patel, Tirth and Vij, Mihir and Zhu, Yueqing and Jain, Siddharth and Franusich, Matthew and Liang, Yin and Liu, Xiao and Liu, Zhengyu and Athiwaratkun, Ben and Zou, Yanqi and Vishwanath, Shreyas and Basu, Arindam and Guan, Hui , journal=

work page
[78]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[79]

Advances in Neural Information Processing Systems , volume=

Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[80]

Psychometrika , volume=

The Approximation of One Matrix by Another of Lower Rank , author=. Psychometrika , volume=

work page

[1] [1]

Strands agents sdk

Amazon Web Services . Strands agents sdk. https://strandsagents.com/, 2026. Open-source agent SDK with model-driven orchestration loop

work page 2026

[2] [2]

Why Do Multi-Agent LLM Systems Fail?

Muhammed Cemri, Yue Shi, Jayaganesh Jeyakumar, Oznur Kislal, George Karypis, and Akash Srivastava. Why do multi-agent LLM systems fail? arXiv preprint arXiv:2503.13657, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [4]

In-context prompting obsoletes agent orchestration for procedural tasks

Simon Dennis, Michael Diamond, Rivaan Patil, Kevin Shabahang, and Hao Guo. In-context prompting obsoletes agent orchestration for procedural tasks. arXiv preprint, 2026 a

work page 2026

[4] [5]

Procedural knowledge is not low-rank: Why LoRA fails to internalize multi-step procedures

Simon Dennis, Kevin Shabahang, Hao Guo, and Rivaan Patil. Procedural knowledge is not low-rank: Why LoRA fails to internalize multi-step procedures. arXiv preprint, 2026 b

work page 2026

[5] [7]

Agent development kit

Google . Agent development kit. https://google.github.io/adk-docs/, 2026. Event-driven agent framework with workflow and LLM agent types

work page 2026

[6] [8]

ReliabilityBench : Evaluating LLM agent reliability under production-like stress conditions

Aayush Gupta. ReliabilityBench : Evaluating LLM agent reliability under production-like stress conditions. arXiv preprint arXiv:2601.06112, 2026

work page arXiv 2026

[7] [9]

A simple language model for task-oriented dialogue

Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33, 2020

work page 2020

[8] [11]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles, 2023

work page 2023

[9] [12]

Langgraph: Build resilient language agents as graphs

LangChain, Inc. Langgraph: Build resilient language agents as graphs. https://github.com/langchain-ai/langgraph, 2024. Directed state graph framework for LLM agent orchestration

work page 2024

[10] [13]

Llamaindex workflows

LlamaIndex . Llamaindex workflows. https://www.llamaindex.ai/workflows, 2026. Event-driven agent workflow framework

work page 2026

[11] [14]

Semantic kernel: Multi-agent orchestration

Microsoft . Semantic kernel: Multi-agent orchestration. https://learn.microsoft.com/en-us/semantic-kernel/, 2026. Enterprise agent framework with sequential, concurrent, and handoff patterns

work page 2026

[12] [15]

Crewai: Framework for orchestrating role-playing ai agents

Jo\ a o Moura. Crewai: Framework for orchestrating role-playing ai agents. https://github.com/crewAIInc/crewAI, 2024. Role-based multi-agent orchestration with Flows and Crews

work page 2024

[13] [16]

Openai agents sdk

OpenAI . Openai agents sdk. https://openai.github.io/openai-agents-python/, 2026. Agent framework with handoff-based orchestration

work page 2026

[14] [18]

LLM-Inference-Bench : Inference benchmarking of large language models on AI accelerators

Krishna Patel, Tirth Patel, Mihir Vij, Yueqing Zhu, Siddharth Jain, Matthew Franusich, Yin Liang, Xiao Liu, Zhengyu Liu, Ben Athiwaratkun, Yanqi Zou, Shreyas Vishwanath, Arindam Basu, and Hui Guan. LLM-Inference-Bench : Inference benchmarking of large language models on AI accelerators. arXiv preprint arXiv:2411.00136, 2024

work page arXiv 2024

[15] [20]

Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent

Hong-Da Xu, Xin-Lan Mao, Pei Yang, Fei Sun, and He Huang. Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent. ACL, 2024

work page 2024

[16] [21]

Agent lumos: Unified and modular training for open-source language agents

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. Agent lumos: Unified and modular training for open-source language agents. ACL, 2024

work page 2024

[17] [22]

Agenttuning: Enabling generalized agent abilities for llms

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. Findings of ACL, 2024

work page 2024

[18] [23]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[19] [24]

verbose database queries correlate with null results

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, and Jiaxuan You. Where LLM agents fail and how they can learn from failures. arXiv preprint arXiv:2509.25370, 2026

work page arXiv 2026

[20] [25]

Compiling Agentic Workflows into

Dennis, Simon and Patil, Rivaan and Shabahang, Kevin and Guo, Hao , journal=. Compiling Agentic Workflows into

work page

[21] [26]

arXiv preprint , year=

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks , author=. arXiv preprint , year=

work page

[22] [27]

Procedural Knowledge Is Not Low-Rank: Why

Dennis, Simon and Shabahang, Kevin and Guo, Hao and Patil, Rivaan , journal=. Procedural Knowledge Is Not Low-Rank: Why

work page

[23] [28]

Efficient Memory Management for Large Language Model Serving with

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient Memory Management for Large Language Model Serving with

work page

[24] [29]

arXiv preprint arXiv:2602.13692 , year=

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System , author=. arXiv preprint arXiv:2602.13692 , year=

work page arXiv

[25] [30]

Advances in Neural Information Processing Systems , volume=

A Simple Language Model for Task-Oriented Dialogue , author=. Advances in Neural Information Processing Systems , volume=

work page

[26] [31]

arXiv preprint arXiv:2310.05915 , year=

FireAct: Toward Language Agent Fine-tuning , author=. arXiv preprint arXiv:2310.05915 , year=

work page arXiv

[27] [32]

arXiv preprint arXiv:2404.14772 , year=

Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models , author=. arXiv preprint arXiv:2404.14772 , year=

work page arXiv

[28] [33]

arXiv preprint arXiv:2411.05451 , year=

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models , author=. arXiv preprint arXiv:2411.05451 , year=

work page arXiv

[29] [34]

Findings of ACL , year=

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. Findings of ACL , year=

work page

[30] [35]

ACL , year=

Agent Lumos: Unified and Modular Training for Open-Source Language Agents , author=. ACL , year=

work page

[31] [36]

ACL , year=

Rethinking Task-Oriented Dialogue Systems: From Complex Modularity to Zero-Shot Autonomous Agent , author=. ACL , year=

work page

[32] [37]

arXiv preprint arXiv:2511.07568 , year=

Procedural Knowledge Improves Agentic LLM Workflows , author=. arXiv preprint arXiv:2511.07568 , year=

work page arXiv

[33] [38]

ICLR , year=

AgentBench: Evaluating LLMs as Agents , author=. ICLR , year=

work page

[34] [39]

Zhu, Kunlun and Liu, Zijia and Li, Bingxuan and Tian, Muxin and Yang, Yingxuan and Zhang, Jiaxun and Han, Pengrui and Xie, Qipeng and Cui, Fuyang and Zhang, Weijia and Ma, Xiaoteng and Yu, Xiaodong and Ramesh, Gowtham and Wu, Jialian and Liu, Zicheng and Lu, Pan and Zou, James and You, Jiaxuan , journal=. Where

work page

[35] [40]

Gupta, Aayush , journal=

work page

[36] [41]

and Nadgir, Nitya and Narayanan, Arvind , journal=

Kapoor, Sayash and Stroebl, Benedikt and Siegel, Zachary S. and Nadgir, Nitya and Narayanan, Arvind , journal=

work page

[37] [42]

An Empirical Study of Agent Developer Practices in

Wang, Yanlin and Xu, Xinyi and Chen, Jiachi and Bi, Tingting and Gu, Wenchao and Zheng, Zibin , journal=. An Empirical Study of Agent Developer Practices in

work page

[38] [43]

Findings of ACL , year=

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , author=. Findings of ACL , year=

work page

[39] [44]

Findings of ACL , year=

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , author=. Findings of ACL , year=

work page

[40] [45]

EMNLP , year=

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs , author=. EMNLP , year=

work page

[41] [46]

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [47]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[43] [48]

1987 , publisher=

Intention, Plans, and Practical Reason , author=. 1987 , publisher=

work page 1987

[44] [49]

Advances in Neural Information Processing Systems , volume=

Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[45] [50]

Finetuned Language Models Are Zero-Shot Learners

Finetuned Language Models Are Zero-Shot Learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [51]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [52]

Advances in Neural Information Processing Systems , volume=

QLoRA: Efficient Finetuning of Quantized LLMs , author=. Advances in Neural Information Processing Systems , volume=

work page

[48] [53]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018

[49] [54]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [55]

2023 , note=

Stanford Alpaca: An Instruction-following LLaMA Model , author=. 2023 , note=

work page 2023

[51] [56]

WizardLM: Empowering large pre-trained language models to follow complex instructions

WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [57]

Proceedings of the First International Conference on Multi-Agent Systems , pages=

BDI Agents: From Theory to Practice , author=. Proceedings of the First International Conference on Multi-Agent Systems , pages=

work page

[53] [58]

arXiv preprint arXiv:2212.01681 , year=

Language Models as Agent Models , author=. arXiv preprint arXiv:2212.01681 , year=

work page arXiv

[54] [59]

2024 , howpublished=

LangGraph: Build Resilient Language Agents as Graphs , author=. 2024 , howpublished=

work page 2024

[55] [60]

2024 , howpublished=

CrewAI: Framework for Orchestrating Role-Playing AI Agents , author=. 2024 , howpublished=

work page 2024

[56] [61]

2026 , howpublished=

Agent Development Kit , author=. 2026 , howpublished=

work page 2026

[57] [62]

2026 , howpublished=

OpenAI Agents SDK , author=. 2026 , howpublished=

work page 2026

[58] [63]

2026 , howpublished=

Semantic Kernel: Multi-Agent Orchestration , author=. 2026 , howpublished=

work page 2026

[59] [64]

2026 , howpublished=

Strands Agents SDK , author=. 2026 , howpublished=

work page 2026

[60] [65]

2026 , howpublished=

LlamaIndex Workflows , author=. 2026 , howpublished=

work page 2026

[61] [66]

Why Do Multi-Agent

Cemri, Muhammed and Shi, Yue and Jeyakumar, Jayaganesh and Kislal, Oznur and Karypis, George and Srivastava, Akash , journal=. Why Do Multi-Agent

work page

[62] [67]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [68]

Advances in Neural Information Processing Systems , volume=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. Advances in Neural Information Processing Systems , volume=

work page

[64] [69]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [70]

LLM Evaluators Recognize and Favor Their Own Generations

LLM Evaluators Recognize and Favor Their Own Generations , author=. arXiv preprint arXiv:2404.13076 , year=

work page internal anchor Pith review arXiv

[66] [71]

Transactions of the Association for Computational Linguistics , year=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , year=

work page

[67] [72]

Towards a Science of Scaling Agent Systems

Towards a Science of Scaling Agent Systems , author=. arXiv preprint arXiv:2512.08296 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [73]

When single-agent with skills replace multi-agent systems and when they fail.arXiv preprint arXiv:2601.04748, 2026

When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail , author=. arXiv preprint arXiv:2601.04748 , year=

work page arXiv

[69] [74]

arXiv preprint arXiv:2601.12307 , year=

Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline , author=. arXiv preprint arXiv:2601.12307 , year=

work page arXiv

[70] [75]

arXiv preprint arXiv:2307.09923 , year=

Large Language Models can accomplish Business Process Management Tasks , author=. arXiv preprint arXiv:2307.09923 , year=

work page arXiv

[71] [76]

, journal=

Schneider, Walter and Shiffrin, Richard M. , journal=. Controlled and Automatic Human Information Processing:. 1977 , publisher=

work page 1977

[72] [77]

Patel, Krishna and Patel, Tirth and Vij, Mihir and Zhu, Yueqing and Jain, Siddharth and Franusich, Matthew and Liang, Yin and Liu, Xiao and Liu, Zhengyu and Athiwaratkun, Ben and Zou, Yanqi and Vishwanath, Shreyas and Basu, Arindam and Guan, Hui , journal=

work page

[73] [78]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[74] [79]

Advances in Neural Information Processing Systems , volume=

Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[75] [80]

Psychometrika , volume=

The Approximation of One Matrix by Another of Lower Rank , author=. Psychometrika , volume=

work page