Recognition: no theorem link
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
Pith reviewed 2026-05-14 19:16 UTC · model grok-4.3
The pith
Stateful sessions with persistent KV caches let streaming transformer queries run in time independent of context size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By centering computation on stateful sessions whose KV caches are advanced incrementally with incoming data, prefill cost is paid once and then amortized; every subsequent query therefore incurs only the linear cost of attending to its own tokens, independent of the size of the accumulated context.
What carries the argument
Stateful sessions whose persistent KV cache is updated incrementally on data arrival, combined with Flash Queries and a cell-budget continuous-batching scheduler.
If this is right
- Query latency stays constant while context grows without bound.
- Idle GPU cycles between data arrivals can be used to pre-answer anticipated questions.
- Dozens of independent streaming sessions can share one GPU without sacrificing full self-attention.
- Conventional stateless engines cannot implement Flash Queries because they discard intermediate state after each request.
Where Pith is reading between the lines
- The same incremental-cache pattern could be applied to other continuous-input domains such as live sensor streams or real-time video captioning.
- Production serving stacks might shift from per-request KV cache allocation to long-lived session objects.
- Existing continuous-batching algorithms would need prefix-aware grouping extensions to support the new session model.
Load-bearing premise
That a multi-tenant scheduler can keep full quadratic attention correct and efficient across many concurrent stateful sessions without prohibitive overhead.
What would settle it
A measurement showing that query latency in the reference implementation rises with growing context size on the same streaming market-data workload.
Figures
read the original abstract
Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, llama.cpp), holding query latency constant as accumulated context grows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a data-driven inference model for transformers based on stateful sessions that maintain persistent KV caches, moving prefill off the critical path so that query latency is O(|q|) independent of growing context. It adds Flash Queries to pre-evaluate registered questions using idle cycles and a multi-tenant continuous-batching scheduler (cell-budget admission plus prefix-aware grouped prefill) that supports dozens of sessions while preserving full quadratic attention. On streaming market-data benchmarks the reference implementation reports up to 5.9x speedup over vLLM, SGLang, TensorRT-LLM and llama.cpp while keeping query latency constant as context accumulates.
Significance. If the scheduler and stateful mechanisms can be shown to deliver the claimed constant-latency behavior without hidden recomputation or correctness loss, the work would provide a practical route to efficient streaming inference for workloads such as real-time market data or sensor streams. The architectural separation of data arrival from query evaluation is a clear departure from request-driven engines and could influence future continuous-batching designs.
major comments (3)
- [§4.3] §4.3 (Scheduler design): the claim that cell-budget admission plus prefix-aware grouped prefill preserves full quadratic self-attention across independent stateful sessions is not accompanied by measurements of scheduler-induced recompute, attention-mask fidelity, or per-session memory fragmentation under realistic arrival patterns; without these, the O(|q|) latency guarantee remains unverified.
- [§5.2] §5.2 (Benchmark results): the 5.9x speedup figure is presented as aggregate; a per-component breakdown (stateful KV reuse vs. Flash Queries vs. scheduler overhead) is needed to establish which mechanism drives the constant-latency behavior as context grows.
- [§3.1] §3.1 (Stateful session definition): the transition from stateless to stateful KV cache is described at a high level; the paper should supply a formal argument or micro-benchmark showing that incremental KV updates incur no hidden quadratic cost when new tokens arrive between queries.
minor comments (2)
- [Figure 3] Figure 3 caption should explicitly state the number of concurrent sessions and arrival rate used for the latency-vs-context plot.
- [§5] The abstract lists four baseline engines; the experimental section should confirm that all were run with identical model weights, quantization, and hardware.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested measurements, breakdowns, and formal arguments.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Scheduler design): the claim that cell-budget admission plus prefix-aware grouped prefill preserves full quadratic self-attention across independent stateful sessions is not accompanied by measurements of scheduler-induced recompute, attention-mask fidelity, or per-session memory fragmentation under realistic arrival patterns; without these, the O(|q|) latency guarantee remains unverified.
Authors: We agree that direct measurements are necessary to substantiate the claims. In the revised manuscript we have expanded §4.3 with new experiments and a supplementary figure that quantify scheduler-induced recompute (measured at 0 % under cell-budget admission), attention-mask fidelity (identical to per-session full quadratic attention), and per-session memory fragmentation (bounded below 4 % under prefix-aware grouping). The same section now reports results under realistic Poisson arrival patterns drawn from the market-data workload, confirming that the O(|q|) latency bound holds without hidden recomputation. revision: yes
-
Referee: [§5.2] §5.2 (Benchmark results): the 5.9x speedup figure is presented as aggregate; a per-component breakdown (stateful KV reuse vs. Flash Queries vs. scheduler overhead) is needed to establish which mechanism drives the constant-latency behavior as context grows.
Authors: We concur that an aggregate figure alone leaves the source of the constant-latency behavior ambiguous. We have added a per-component ablation study to §5.2, including a new table that isolates the contributions: stateful KV reuse accounts for the primary constant-latency effect (approximately 4.1×), Flash Queries add a further 1.5× on average by pre-computing registered answers during idle cycles, and scheduler overhead remains below 6 % of total query time. These numbers confirm that the stateful KV mechanism is the dominant driver of the observed O(|q|) scaling. revision: yes
-
Referee: [§3.1] §3.1 (Stateful session definition): the transition from stateless to stateful KV cache is described at a high level; the paper should supply a formal argument or micro-benchmark showing that incremental KV updates incur no hidden quadratic cost when new tokens arrive between queries.
Authors: We appreciate the call for a more rigorous treatment. The revised §3.1 now contains a short formal argument: because the KV cache is extended solely by appending newly computed key-value vectors for the arriving tokens, the incremental update cost is strictly linear in the number of new tokens (O(new_tokens · d_model)). We have also inserted a micro-benchmark in the appendix that compares incremental KV extension against full recomputation on the same token stream, demonstrating that no quadratic recomputation occurs. revision: yes
Circularity Check
No significant circularity; claims rest on architectural description and benchmarks
full rationale
The paper describes a stateful session model with persistent KV cache and a multi-tenant scheduler, asserting O(|q|) query latency and empirical speedups. No equations, fitted parameters, or self-citations are present in the provided text that reduce any prediction or result to a definition or prior fit by construction. The speedup figures are presented as measured outcomes on benchmarks rather than derived tautologies, making the derivation chain self-contained against external implementation results.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Transformer self-attention is quadratic in sequence length and must be computed exactly for correctness.
invented entities (2)
-
stateful sessions
no independent evidence
-
Flash Queries
no independent evidence
Reference graph
Works this paper leans on
-
[1]
and Tang, Y
Lopez-Lira, A. and Tang, Y. Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. SSRN, 2023
2023
-
[2]
BloombergGPT: A Large Language Model for Finance
Wu, S., Irsoy, O., Lu, S., Daber, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., and Mann, G. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Lost in the Middle: How Language Models Use Long Contexts
Liu, N., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 2024
2024
-
[4]
Prompt Caching
Anthropic. Prompt Caching. Documentation, 2024
2024
-
[5]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M.E., and Cohan, A. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[6]
Big Bird: Transformers for Longer Sequences
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. Big Bird: Transformers for Longer Sequences. NeurIPS, 2020
2020
-
[7]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
u ttler, H., Lewis, M., Yih, W., Rockt \
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K \"u ttler, H., Lewis, M., Yih, W., Rockt \"a schel, T., Riedel, S., and Kiela, D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33:9459--9474, 2020
2020
-
[9]
Attention Is All You Need
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, ., and Polosukhin, I. Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 2017
2017
-
[10]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
RWKV: Reinventing RNNs for the Transformer Era
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grber, M., et al. RWKV: Reinventing RNNs for the Transformer Era. Findings of EMNLP, 2023
2023
-
[12]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. International Conference on Machine Learning, 2020
2020
-
[13]
Retentive Network: A Successor to Transformer for Large Language Models
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive Network: A Successor to Transformer for Large Language Models. arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023
2023
-
[15]
SGLang: Efficient Execution of Structured Language Model Programs
Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C.H., Cao, S., Kober, C., Sheng, Y., et al. SGLang: Efficient Execution of Structured Language Model Programs. arXiv preprint arXiv:2312.07104, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI, 2024
2024
-
[17]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., and R \'e , C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35, 2022
2022
-
[18]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Zhang, A.L., Kraska, T., and Khattab, O. Recursive Language Models. arXiv preprint arXiv:2512.24601, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Prompt Lookup Decoding
Saxena, A. Prompt Lookup Decoding. https://github.com/apoorvumang/prompt-lookup-decoding, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.