pith. sign in

arxiv: 2605.26289 · v1 · pith:L45PLJRMnew · submitted 2026-05-25 · 💻 cs.LG

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Pith reviewed 2026-06-29 22:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords stateful inferenceKV cache reusemulti-agent tool callingLLM serving optimizationdelta-only computationspeculative decodingpersistent prefix cache
0
0 comments X

The pith

A persistent KV cache across turns reduces multi-agent tool-calling cost from full-prompt reprocessing to only the new tokens each step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that conventional inference servers restart from the entire conversation history on every tool call even though 85 to 95 percent of the prompt is identical to the previous turn. It presents a stateful architecture that keeps the key-value cache alive, advances it only on the delta tokens, and adds a radix prefix cache plus speculative decoding to handle interleaved agents and structured outputs. If the approach holds, long agent workflows become far cheaper per additional turn because repeated computation on unchanged context is eliminated. The reference system reports 2.1 times faster per-turn performance on a 6-turn workflow and 4.2 times on the median turn of a 35-turn workflow, cutting end-to-end wall time in half.

Core claim

The central claim is that a stateful inference architecture converts the O(n_t) per-turn cost of conventional serving into an O(Δ_t) delta-only cost. A persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel fully-generated workloads the reference implementation is 2.1 times faster per turn on a 6-turn agentic workflow and 4.2 times on the median turn of a 35-turn one, with the advantage coming from stateful reuse and speculation rather than caching.

What carries the argument

Persistent KV cache that lives across turns and advances by ingesting only the new tokens

If this is right

  • End-to-end wall time for multi-turn agentic workflows is halved.
  • The relative speedup increases with workflow length because the unchanged prefix is never recomputed.
  • Structured outputs incur lower overhead once the context is already cached.
  • Interleaved traffic from multiple agents can share prefix computations without duplication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same delta-only pattern could apply to any accumulating context session such as long chats or iterative planning loops.
  • Production agent deployments could support deeper or longer-running workflows once marginal cost per turn drops.
  • Workflow designers might deliberately increase turn count if they know recomputation is avoided.

Load-bearing premise

Existing inference frameworks always treat each tool call as an independent request and re-process the full conversation from scratch.

What would settle it

Running the reference implementation head-to-head with vLLM or SGLang on the 35-turn workload and observing that median per-turn latency is not at least 4 times lower.

Figures

Figures reproduced from arXiv: 2605.26289 by Victor Norgren.

Figure 1
Figure 1. Figure 1: Median per-turn latency on novel, fully-generated tool-call workloads (no response-cache [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LayerScale speedup over vLLM and SGLang on median per-turn latency, with no caching [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative wall time over the 35-turn novel coding workflow (no response-cache hits). [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that conventional inference frameworks re-process full conversation histories for each tool call in multi-agent settings despite 85-95% prompt reuse across turns. It introduces a stateful architecture using a persistent KV cache that advances only on new tokens (O(Δ_t) cost), a radix prefix cache for interleaved multi-agent traffic, and a prompt-lookup speculative decoder for structured outputs. On novel fully-generated workloads, the implementation reports 2.1× per-turn speedup versus vLLM/SGLang on 6-turn workflows and 4.2× on the median turn of 35-turn workflows, halving end-to-end time, with gains attributed to stateful reuse and speculation rather than caching alone.

Significance. If the O(Δ_t) conversion and speedups hold under broader conditions, the work could meaningfully reduce latency in production multi-agent tool-calling systems by exploiting conversation statefulness. The combination of persistent KV, radix caching, and speculation is a concrete engineering contribution that distinguishes the approach from standard prefix caching.

major comments (2)
  1. [Evaluation] Evaluation section (workloads description): experiments are restricted to 'novel, fully-generated workloads' engineered to exhibit the 85-95% unchanged-prompt property. No results are shown for organic multi-agent traces with variable tool-response lengths, branching contexts, or lower reuse rates; this leaves the practical O(Δ_t) advantage and the 'not caching' distinction unverified for realistic deployments.
  2. [§3] §3 (Architecture): the claim that the radix prefix cache 'extends this across interleaved multi-agent traffic' is central to handling concurrent agents, yet no ablation isolates its contribution versus the persistent KV cache alone, nor quantifies hit rates under the reported workloads.
minor comments (2)
  1. [Abstract] Abstract and §2: the notation O(n_t) and O(Δ_t) is used without an explicit definition of n_t or Δ_t in terms of token counts or cache operations; a short formalization would improve clarity.
  2. [Figures] Figure captions (throughout): several figures comparing against vLLM and SGLang lack error bars or mention of the number of runs, making it hard to assess variability of the 2.1×/4.2× speedups.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive comments. We respond to each major point below with clarifications on the evaluation design and architecture, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (workloads description): experiments are restricted to 'novel, fully-generated workloads' engineered to exhibit the 85-95% unchanged-prompt property. No results are shown for organic multi-agent traces with variable tool-response lengths, branching contexts, or lower reuse rates; this leaves the practical O(Δ_t) advantage and the 'not caching' distinction unverified for realistic deployments.

    Authors: The generated workloads were constructed specifically to isolate and measure the O(Δ_t) delta cost under the stated 85-95% reuse condition, allowing precise attribution of speedups to stateful reuse rather than other factors. This controlled setting demonstrates the architectural conversion from O(n) to O(Δ) when the reuse property holds, which is the core claim. We agree that organic traces with variable lengths, branching, and lower reuse would be needed to fully verify practical gains in arbitrary deployments. We will revise the evaluation section to discuss expected behavior under reduced reuse and explicitly note the limitation regarding organic data. revision: partial

  2. Referee: [§3] §3 (Architecture): the claim that the radix prefix cache 'extends this across interleaved multi-agent traffic' is central to handling concurrent agents, yet no ablation isolates its contribution versus the persistent KV cache alone, nor quantifies hit rates under the reported workloads.

    Authors: A persistent KV cache alone maintains state per conversation but does not efficiently share prefixes across interleaved agents with overlapping histories; the radix structure enables that sharing for concurrent multi-agent traffic. The reported workloads include such interleaving, so the speedups reflect the combined system. We will add text in §3 clarifying this necessity and the distinction from standard per-request prefix caching. However, the manuscript does not contain a separate ablation or hit-rate numbers, so we cannot add quantitative isolation without new experiments. revision: partial

standing simulated objections not resolved
  • Empirical results on organic multi-agent traces with variable tool-response lengths, branching contexts, or lower reuse rates
  • Quantitative ablation isolating the radix prefix cache contribution versus persistent KV cache alone, including hit rates

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an engineering architecture (persistent KV cache + radix prefix cache + prompt-lookup speculative decoder) that by design converts per-turn cost from O(n_t) to O(Δ_t). No mathematical derivation, fitted parameters, self-citations, or ansatzes are invoked; the central claims rest on direct empirical timing against external baselines (vLLM, SGLang) on the stated workloads. No step reduces to a self-referential definition or input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The 85-95% prompt overlap figure functions as an unverified domain assumption.

axioms (1)
  • domain assumption 85-95% of the prompt is unchanged from the previous turn in multi-agent tool calling
    Invoked in the opening paragraph to motivate the O(Δ_t) claim.

pith-pipeline@v0.9.1-grok · 5695 in / 1193 out tokens · 27650 ms · 2026-06-29T22:36:05.583392+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path

    cs.LG 2026-06 unverdicted novelty 6.0

    Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023

  2. [2]

    SGLang: Efficient Execution of Structured Language Model Programs

    Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C.H., Cao, S., Kober, C., Sheng, Y., et al. SGLang: Efficient Execution of Structured Language Model Programs. arXiv preprint arXiv:2312.07104, 2024

  3. [3]

    Prompt Caching

    Anthropic. Prompt Caching. Documentation, 2024

  4. [4]

    DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

    Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI, 2024

  5. [5]

    Orca: A Distributed Serving System for Transformer-Based Generative Models

    Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., and Chun, B.G. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI, 2022

  6. [6]

    Fast Inference from Transformers via Speculative Decoding

    Leviathan, Y., Kalman, M., and Matias, Y. Fast Inference from Transformers via Speculative Decoding. ICML, 2023

  7. [7]

    Prompt Lookup Decoding

    Saxena, A. Prompt Lookup Decoding. https://github.com/apoorvumang/prompt-lookup-decoding, 2023

  8. [8]

    Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

    Norgren, V. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers. arXiv preprint arXiv:2605.13784, 2026. https://arxiv.org/abs/2605.13784