Stateful Inference for Low-Latency Multi-Agent Tool Calling

Victor Norgren

arxiv: 2605.26289 · v1 · pith:L45PLJRMnew · submitted 2026-05-25 · 💻 cs.LG

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Victor Norgren This is my paper

Pith reviewed 2026-06-29 22:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords stateful inferenceKV cache reusemulti-agent tool callingLLM serving optimizationdelta-only computationspeculative decodingpersistent prefix cache

0 comments

The pith

A persistent KV cache across turns reduces multi-agent tool-calling cost from full-prompt reprocessing to only the new tokens each step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that conventional inference servers restart from the entire conversation history on every tool call even though 85 to 95 percent of the prompt is identical to the previous turn. It presents a stateful architecture that keeps the key-value cache alive, advances it only on the delta tokens, and adds a radix prefix cache plus speculative decoding to handle interleaved agents and structured outputs. If the approach holds, long agent workflows become far cheaper per additional turn because repeated computation on unchanged context is eliminated. The reference system reports 2.1 times faster per-turn performance on a 6-turn workflow and 4.2 times on the median turn of a 35-turn workflow, cutting end-to-end wall time in half.

Core claim

The central claim is that a stateful inference architecture converts the O(n_t) per-turn cost of conventional serving into an O(Δ_t) delta-only cost. A persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel fully-generated workloads the reference implementation is 2.1 times faster per turn on a 6-turn agentic workflow and 4.2 times on the median turn of a 35-turn one, with the advantage coming from stateful reuse and speculation rather than caching.

What carries the argument

Persistent KV cache that lives across turns and advances by ingesting only the new tokens

If this is right

End-to-end wall time for multi-turn agentic workflows is halved.
The relative speedup increases with workflow length because the unchanged prefix is never recomputed.
Structured outputs incur lower overhead once the context is already cached.
Interleaved traffic from multiple agents can share prefix computations without duplication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same delta-only pattern could apply to any accumulating context session such as long chats or iterative planning loops.
Production agent deployments could support deeper or longer-running workflows once marginal cost per turn drops.
Workflow designers might deliberately increase turn count if they know recomputation is avoided.

Load-bearing premise

Existing inference frameworks always treat each tool call as an independent request and re-process the full conversation from scratch.

What would settle it

Running the reference implementation head-to-head with vLLM or SGLang on the 35-turn workload and observing that median per-turn latency is not at least 4 times lower.

Figures

Figures reproduced from arXiv: 2605.26289 by Victor Norgren.

**Figure 2.** Figure 2: LayerScale speedup over vLLM and SGLang on median per-turn latency, with no caching [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative wall time over the 35-turn novel coding workflow (no response-cache hits). [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a stateful KV-cache system for multi-agent tool calling that reuses state across turns and reports speedups on synthetic workloads, but real-trace validation is still needed.

read the letter

The core idea is a persistent KV cache that advances only on new tokens per turn, paired with a radix prefix cache for interleaved agents and speculative decoding for tool outputs. This turns repeated full-prompt processing into delta-only work, which matches the 85-95% reuse the authors observe in agent loops.

What stands out is the concrete integration for multi-agent traffic rather than single conversations. The reference implementation beats vLLM and SGLang by 2.1× per turn on a 6-turn workflow and 4.2× on the median turn of a 35-turn one, cutting end-to-end time in half. That is useful evidence for anyone running agentic systems at scale.

The soft spot is the evaluation. All numbers come from novel, fully-generated workloads built to exhibit high reuse. Real deployments often have variable tool-response lengths, branching contexts, and less predictable interleaving, which could shrink the prefix hits and shrink the reported gains. The paper notes the advantage comes from stateful reuse rather than simple caching, but without tests on organic traces the O(n) to O(Δ) conversion is not yet proven load-bearing.

The math is standard KV-cache reuse plus known radix and speculative techniques; nothing looks circular. Citations focus on the relevant serving baselines.

This is for engineers tuning inference for production multi-agent tool use. A reader who needs lower latency on repeated agent turns will find the architecture and numbers worth examining. It deserves peer review because the problem is timely and the implementation claims are specific enough to check.

Referee Report

2 major / 2 minor

Summary. The paper claims that conventional inference frameworks re-process full conversation histories for each tool call in multi-agent settings despite 85-95% prompt reuse across turns. It introduces a stateful architecture using a persistent KV cache that advances only on new tokens (O(Δ_t) cost), a radix prefix cache for interleaved multi-agent traffic, and a prompt-lookup speculative decoder for structured outputs. On novel fully-generated workloads, the implementation reports 2.1× per-turn speedup versus vLLM/SGLang on 6-turn workflows and 4.2× on the median turn of 35-turn workflows, halving end-to-end time, with gains attributed to stateful reuse and speculation rather than caching alone.

Significance. If the O(Δ_t) conversion and speedups hold under broader conditions, the work could meaningfully reduce latency in production multi-agent tool-calling systems by exploiting conversation statefulness. The combination of persistent KV, radix caching, and speculation is a concrete engineering contribution that distinguishes the approach from standard prefix caching.

major comments (2)

[Evaluation] Evaluation section (workloads description): experiments are restricted to 'novel, fully-generated workloads' engineered to exhibit the 85-95% unchanged-prompt property. No results are shown for organic multi-agent traces with variable tool-response lengths, branching contexts, or lower reuse rates; this leaves the practical O(Δ_t) advantage and the 'not caching' distinction unverified for realistic deployments.
[§3] §3 (Architecture): the claim that the radix prefix cache 'extends this across interleaved multi-agent traffic' is central to handling concurrent agents, yet no ablation isolates its contribution versus the persistent KV cache alone, nor quantifies hit rates under the reported workloads.

minor comments (2)

[Abstract] Abstract and §2: the notation O(n_t) and O(Δ_t) is used without an explicit definition of n_t or Δ_t in terms of token counts or cache operations; a short formalization would improve clarity.
[Figures] Figure captions (throughout): several figures comparing against vLLM and SGLang lack error bars or mention of the number of runs, making it hard to assess variability of the 2.1×/4.2× speedups.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive comments. We respond to each major point below with clarifications on the evaluation design and architecture, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Evaluation] Evaluation section (workloads description): experiments are restricted to 'novel, fully-generated workloads' engineered to exhibit the 85-95% unchanged-prompt property. No results are shown for organic multi-agent traces with variable tool-response lengths, branching contexts, or lower reuse rates; this leaves the practical O(Δ_t) advantage and the 'not caching' distinction unverified for realistic deployments.

Authors: The generated workloads were constructed specifically to isolate and measure the O(Δ_t) delta cost under the stated 85-95% reuse condition, allowing precise attribution of speedups to stateful reuse rather than other factors. This controlled setting demonstrates the architectural conversion from O(n) to O(Δ) when the reuse property holds, which is the core claim. We agree that organic traces with variable lengths, branching, and lower reuse would be needed to fully verify practical gains in arbitrary deployments. We will revise the evaluation section to discuss expected behavior under reduced reuse and explicitly note the limitation regarding organic data. revision: partial
Referee: [§3] §3 (Architecture): the claim that the radix prefix cache 'extends this across interleaved multi-agent traffic' is central to handling concurrent agents, yet no ablation isolates its contribution versus the persistent KV cache alone, nor quantifies hit rates under the reported workloads.

Authors: A persistent KV cache alone maintains state per conversation but does not efficiently share prefixes across interleaved agents with overlapping histories; the radix structure enables that sharing for concurrent multi-agent traffic. The reported workloads include such interleaving, so the speedups reflect the combined system. We will add text in §3 clarifying this necessity and the distinction from standard per-request prefix caching. However, the manuscript does not contain a separate ablation or hit-rate numbers, so we cannot add quantitative isolation without new experiments. revision: partial

standing simulated objections not resolved

Empirical results on organic multi-agent traces with variable tool-response lengths, branching contexts, or lower reuse rates
Quantitative ablation isolating the radix prefix cache contribution versus persistent KV cache alone, including hit rates

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an engineering architecture (persistent KV cache + radix prefix cache + prompt-lookup speculative decoder) that by design converts per-turn cost from O(n_t) to O(Δ_t). No mathematical derivation, fitted parameters, self-citations, or ansatzes are invoked; the central claims rest on direct empirical timing against external baselines (vLLM, SGLang) on the stated workloads. No step reduces to a self-referential definition or input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The 85-95% prompt overlap figure functions as an unverified domain assumption.

axioms (1)

domain assumption 85-95% of the prompt is unchanged from the previous turn in multi-agent tool calling
Invoked in the opening paragraph to motivate the O(Δ_t) claim.

pith-pipeline@v0.9.1-grok · 5695 in / 1193 out tokens · 27650 ms · 2026-06-29T22:36:05.583392+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path
cs.LG 2026-06 unverdicted novelty 6.0

Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023

2023
[2]

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C.H., Cao, S., Kober, C., Sheng, Y., et al. SGLang: Efficient Execution of Structured Language Model Programs. arXiv preprint arXiv:2312.07104, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Prompt Caching

Anthropic. Prompt Caching. Documentation, 2024

2024
[4]

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI, 2024

2024
[5]

Orca: A Distributed Serving System for Transformer-Based Generative Models

Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., and Chun, B.G. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI, 2022

2022
[6]

Fast Inference from Transformers via Speculative Decoding

Leviathan, Y., Kalman, M., and Matias, Y. Fast Inference from Transformers via Speculative Decoding. ICML, 2023

2023
[7]

Prompt Lookup Decoding

Saxena, A. Prompt Lookup Decoding. https://github.com/apoorvumang/prompt-lookup-decoding, 2023

2023
[8]

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

Norgren, V. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers. arXiv preprint arXiv:2605.13784, 2026. https://arxiv.org/abs/2605.13784

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023

2023

[2] [2]

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C.H., Cao, S., Kober, C., Sheng, Y., et al. SGLang: Efficient Execution of Structured Language Model Programs. arXiv preprint arXiv:2312.07104, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Prompt Caching

Anthropic. Prompt Caching. Documentation, 2024

2024

[4] [4]

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI, 2024

2024

[5] [5]

Orca: A Distributed Serving System for Transformer-Based Generative Models

Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., and Chun, B.G. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI, 2022

2022

[6] [6]

Fast Inference from Transformers via Speculative Decoding

Leviathan, Y., Kalman, M., and Matias, Y. Fast Inference from Transformers via Speculative Decoding. ICML, 2023

2023

[7] [7]

Prompt Lookup Decoding

Saxena, A. Prompt Lookup Decoding. https://github.com/apoorvumang/prompt-lookup-decoding, 2023

2023

[8] [8]

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

Norgren, V. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers. arXiv preprint arXiv:2605.13784, 2026. https://arxiv.org/abs/2605.13784

work page internal anchor Pith review Pith/arXiv arXiv 2026