Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

· 2026 · cs.LG · arXiv 2605.13784

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, llama.cpp), holding query latency constant as accumulated context grows.

representative citing papers

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.

Stateful Inference for Low-Latency Multi-Agent Tool Calling

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

Stateful KV cache with radix prefix cache and prompt-lookup speculative decoder reduces per-turn cost from O(n) to O(Δ) and delivers 2.1-4.2× speedups versus vLLM and SGLang on generated multi-agent workloads.

citing papers explorer

Showing 2 of 2 citing papers.

Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path cs.LG · 2026-06-28 · unverdicted · none · ref 11 · internal anchor
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.
Stateful Inference for Low-Latency Multi-Agent Tool Calling cs.LG · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
Stateful KV cache with radix prefix cache and prompt-lookup speculative decoder reduces per-turn cost from O(n) to O(Δ) and delivers 2.1-4.2× speedups versus vLLM and SGLang on generated multi-agent workloads.

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

fields

years

verdicts

representative citing papers

citing papers explorer