Runtime-Certified Bounded-Error Quantized Attention

Dean Calver

arxiv: 2605.20868 · v1 · pith:2AVEJ224new · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Runtime-Certified Bounded-Error Quantized Attention

Dean Calver This is my paper

Pith reviewed 2026-05-21 05:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords KV cache quantizationruntime certificationattention error boundsLLM inferenceadaptive precisionfallback mechanismslong context modelsquantized attention

0 comments

The pith

A two-term error decomposition yields runtime-computable bounds that certify per-head quantized attention or trigger exact fallback recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that quantizing the KV cache for long-context LLM inference can be made reliable by splitting the total approximation error into two terms: one capturing how key quantization distorts the attention distribution, and the other capturing value reconstruction error. These terms are bounded separately for each head and each step, and the bounds are evaluated during inference itself. The resulting numbers control an adaptive choice of precision and a ladder of fallbacks that ultimately restores the exact dense FP16 attention output whenever any bound is threatened. A sympathetic reader would care because this turns memory-saving quantization from an empirical risk into a computation that can be verified and corrected on the fly, without waiting for post-hoc checks.

Core claim

The authors establish that a two-term error decomposition yields per-head, per-step bounds on attention distribution distortion from key quantization and value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. The certification remains local to each attention computation and does not claim end-to-end model correctness, but ensures every step is either provably close to an FP16 reference or exactly recovered via fallback.

What carries the argument

The two-term error decomposition that isolates attention distribution distortion caused by key quantization from value reconstruction error, together with the online bounds and multi-stage fallback ladder that enforce either a certified bound or exact recovery.

If this is right

The system matches dense FP16 KV quality within noise on language modelling and retrieval tasks across PG-19, NIAH and RULER benchmarks with contexts up to 128K.
Catastrophic failures observed in naive INT8/INT4 baselines are recovered through the fallback ladder.
Value-sensitive tasks expose a controlled trade-off that can be removed by tightening value tolerances or using FP16-value fallback.
The certification applies locally per head and per step rather than to the full model output.
KV cache quantization is reframed as runtime-verified computation instead of a fixed approximation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition idea could be tested on other approximate operations inside the transformer block, such as activation or weight quantization, if analogous error terms can be isolated.
Production systems with strict quality SLAs might adopt the fallback ladder to guarantee that no single attention step silently degrades output.
Extending the bounds to account for interactions across multiple heads or layers would be a natural next measurement to check whether the current per-head locality is sufficient.

Load-bearing premise

The two-term error decomposition accurately separates and bounds the quantization effects in a way that remains valid and computable at runtime without missing significant interactions.

What would settle it

A specific attention step in which the measured output error exceeds the computed bound yet the fallback mechanism fails to restore the exact FP16 result.

Figures

Figures reproduced from arXiv: 2605.20868 by Dean Calver.

**Figure 2.** Figure 2: Fused certified attention with mask-gated key precision. Phase 1 is a lightweight scoring [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Runtime VRAM vs. dense FP16, normalised (measured Tier-1 + scratch). Below 64K the fixed [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

read the original abstract

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a runtime fallback mechanism with online per-head error bounds to quantized KV attention, but the two-term decomposition leaves the cross-term error unaddressed.

read the letter

The main thing to know is that this work gives a concrete way to run aggressive KV cache quantization while having a deterministic path back to exact FP16 attention output when the online bounds flag a problem. It stores INT8 keys and INT4 values on GPU with FP16 originals in RAM, computes two per-head per-step bounds on the fly, and uses them for adaptive precision plus a multi-stage fallback ladder. On the PG-19, NIAH, and RULER suites with LLaMA 3.1-8B up to 128K context, it recovers the failures seen in plain quantized baselines and stays within noise of dense FP16 for the tested tasks. That combination of online bounding and guaranteed recovery is the actual addition over standard empirical quantization checks. The experiments also surface a clear trade-off on short-context value-sensitive work that can be dialed back with tighter tolerances. The local per-head per-step scope is stated plainly, which keeps expectations realistic. The soft spot is the error decomposition itself. The total attention error expands to four terms, yet the paper bounds only the attention-distribution shift from key quantization and the value reconstruction error. The cross term from distorted weights times quantized values is not shown to be negligible or separately controlled under the chosen INT8/INT4 settings and online norm estimates. If that term can grow large enough to break the claimed guarantee even when the two bounded terms are satisfied, the runtime certification loses its force. The paper does not appear to include a derivation or empirical check that rules this out. This is aimed at inference engineers who need to compress long contexts without silent quality drops. Readers who care about verifiable approximations rather than average-case robustness will find the mechanism worth examining. It is concrete enough and tied to real benchmarks to deserve a serious referee, even if the bound completeness needs tightening in revision. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces a tiered KV cache for long-context LLM inference that stores INT8-quantized keys and INT4-quantized values in GPU memory while retaining FP16 originals in system RAM. It proposes a two-term error decomposition that produces per-head, per-step runtime bounds on (i) attention-distribution distortion induced by key quantization and (ii) value-reconstruction error; these bounds drive adaptive precision selection and a multi-stage fallback ladder that guarantees exact recovery to the dense FP16 attention output when any bound is violated. Experiments on LLaMA 3.1-8B with contexts up to 128K across PG-19, NIAH, and RULER benchmarks show that the system matches dense FP16 quality within noise on language-modeling and retrieval tasks while recovering catastrophic failures seen in naive INT8/INT4 baselines, with a controllable fidelity-compression trade-off on value-sensitive short-context tasks.

Significance. If the two-term bounds are shown to be valid and tight, the work provides a concrete mechanism for runtime-verified KV quantization that reframes aggressive compression as a certified rather than heuristic approximation. The local (per-head, per-step) certification together with deterministic fallback is a practical strength for safe deployment under strict quality constraints; the empirical recovery of naive-quantization failures on standard long-context benchmarks further supports its utility.

major comments (1)

[Error decomposition and runtime bounds] The central claim rests on a two-term error decomposition that bounds only attention-distribution distortion from key quantization and value-reconstruction error. However, the actual output error is (W + ΔW)·(V + ΔV) = W·V + W·ΔV + ΔW·V + ΔW·ΔV. The manuscript does not appear to derive an explicit bound on the cross term ΔW·ΔV or demonstrate that it is dominated by the other terms under the chosen INT8/INT4 quantization and online norm estimates; without this, the per-head/per-step guarantee can fail even when the individual bounds hold (see abstract and the error-analysis section describing the decomposition).

minor comments (2)

[Abstract] The abstract states that the certification is local and does not guarantee end-to-end model correctness; a brief discussion of how local attention errors propagate (or do not) to downstream metrics would strengthen the claims.
[System architecture] The description of the multi-stage fallback ladder is high-level; adding a concrete pseudocode or diagram of the decision tree (including how value tolerance is set) would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a potential gap in the error analysis. We address the major comment below and have revised the manuscript to strengthen the presentation of the bounds.

read point-by-point responses

Referee: [Error decomposition and runtime bounds] The central claim rests on a two-term error decomposition that bounds only attention-distribution distortion from key quantization and value-reconstruction error. However, the actual output error is (W + ΔW)·(V + ΔV) = W·V + W·ΔV + ΔW·V + ΔW·ΔV. The manuscript does not appear to derive an explicit bound on the cross term ΔW·ΔV or demonstrate that it is dominated by the other terms under the chosen INT8/INT4 quantization and online norm estimates; without this, the per-head/per-step guarantee can fail even when the individual bounds hold (see abstract and the error-analysis section describing the decomposition).

Authors: We appreciate the referee's precise expansion of the output error and agree that an explicit treatment of the cross term strengthens the analysis. The two-term decomposition in the manuscript separately bounds the attention-weight distortion (ΔW, induced by INT8 key quantization) and the value error (ΔV, from INT4 quantization). To close the gap, we observe that the cross term satisfies ||ΔW · ΔV|| ≤ ||ΔW||_2 · ||ΔV||_2 by the submultiplicative property of the operator norm. Because the per-head bound on ||ΔW|| is already computed online from the key quantization error and the attention-norm estimates, and ||ΔV|| is bounded by the fixed INT4 step size scaled by the value norm, their product is dominated by the sum of the two primary terms under the quantization parameters used (INT8 keys and INT4 values). We have added a short derivation in the error-analysis section that folds this product bound into the total per-head, per-step error guarantee, using the same online norm quantities already maintained by the system. The abstract has been updated for consistency. This revision preserves the original two-term structure while making the total-output guarantee rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity in two-term error decomposition or runtime bounds

full rationale

The paper derives per-head per-step bounds via an explicit two-term decomposition of quantization error in the attention output, computed directly from online norm estimates and the current key/value tensors. This decomposition is presented as an analytical separation of attention-distribution distortion and value-reconstruction error, with a deterministic fallback path to exact FP16 that serves as an independent recovery mechanism. No load-bearing step reduces to a fitted parameter, self-citation chain, or definitional tautology; the bounds are not obtained by renaming known empirical patterns or by importing uniqueness results from prior author work. The central certification claim therefore remains self-contained and does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of decomposing quantization error into two independent terms that can be bounded online and on the assumption that fallback to FP16 originals is always feasible when bounds are exceeded.

free parameters (1)

value tolerance
Controls the trade-off between compression ratio and fidelity in value-sensitive tasks; mentioned as adjustable in the abstract.

axioms (1)

domain assumption Quantization error in keys and values can be decomposed into two separate, independently boundable terms for attention distribution and value reconstruction.
Invoked as the foundation for deriving per-head per-step bounds and driving the fallback ladder.

pith-pipeline@v0.9.0 · 5822 in / 1380 out tokens · 30545 ms · 2026-05-21T05:42:00.635839+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.3 (Key Compression Error) … TV(a,a′) ≤ tanh(Δ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 7 internal anchors

[1]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.NeurIPS, 2024. arXiv:2401.18079

work page arXiv 2024
[2]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.ICML, 2024. arXiv:2402.02750

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

QJL: 1-bit quantized JL transform for KV cache quan- tization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

Zandieh, A., Daliri, M., and Han, I. QJL: 1-bit quantized JL transform for KV cache quan- tization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

work page arXiv 2024
[4]

Hosseini, S. M. T., Ardakani, A., and Gross, W. J. InnerQ: Hardware-aware tuning-free quantization of KV cache for large language models.arXiv preprint arXiv:2602.23200, 2026

work page internal anchor Pith review arXiv 2026
[5]

HACK: Homomorphic acceleration via compression of the key-value cache for disaggregated LLM inference.SIGCOMM, 2025

Zhang, Z., Shen, H., Vargaftik, S., Ben Basat, R., Mitzenmacher, M., and Yu, M. HACK: Homomorphic acceleration via compression of the key-value cache for disaggregated LLM inference.SIGCOMM, 2025

work page 2025
[6]

VecInfer: Efficient LLM inference with low-bit KV cache via outlier-suppressed vector quantization.arXiv preprint arXiv:2510.06175, 2025

Yao, D., Yang, C., Tong, Z., Lin, Z., Liu, W., Luan, J., and Wang, W. VecInfer: Efficient LLM inference with low-bit KV cache via outlier-suppressed vector quantization.arXiv preprint arXiv:2510.06175, 2025

work page arXiv 2025
[7]

FlashInfer: Efficient and customizable attention engine for LLM inference serving.MLSys, 2025

Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., and Ceze, L. FlashInfer: Efficient and customizable attention engine for LLM inference serving.MLSys, 2025

work page 2025
[8]

H 2O: Heavy-hitter oracle for efficient generative inference of large language models.NeurIPS,

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., and Chen, B. H 2O: Heavy-hitter oracle for efficient generative inference of large language models.NeurIPS,

work page
[9]

Efficient streaming language models with attention sinks.ICLR, 2024

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks.ICLR, 2024

work page 2024
[10]

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Yuan, J., Shinn, C., Xu, K., Cui, J., Klimi- ashvili, G., Xiao, G., Zheng, P ., Li, B., Zhou, Y., Ye, Z., You, W., Zheng, T., Brown, D., Wang, P ., Hoehnerbach, M., Cai, R., Demouth, J., Owens, J. D., Hu, X., Han, S., Liu, T., and Mao, H. BLASST: Dynamic blocked atten- tion sparsity via softmax thresholding.arXiv preprint arXiv:2512.12087, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

PSA: Progressive sparse attention for long-context inference.arXiv preprint arXiv:2503.00392, 2025

Zhou, Q., Yin, P ., Zuo, P ., and Cheng, J. PSA: Progressive sparse attention for long-context inference.arXiv preprint arXiv:2503.00392, 2025

work page arXiv 2025
[12]

Twilight: Adaptive attention sparsity with hierarchical top-p pruning.arXiv preprint arXiv:2502.02770,

Lin, C., Tang, J., Yang, S., Wang, H., Tang, T., Tian, B., Stoica, I., Han, S., and Gao, M. Twilight: Adaptive attention sparsity with hierarchical top- p pruning.NeurIPS, 2025. 28 Runtime-Certified Bounded-Error Quantized Attention arXiv:2502.02770

work page arXiv 2025
[13]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. SpargeAttn: Accurate sparse attention accelerating any model infer- ence.arXiv preprint arXiv:2502.18137, 2025

work page arXiv 2025
[14]

A mathematical the- ory of top-k sparse attention via total varia- tion distance.arXiv preprint arXiv:2512.07647, 2025

Tzachristas, G., Deng, L., Tzachristas, I., Zhang, G., and Chen, R. A mathematical the- ory of top-k sparse attention via total varia- tion distance.arXiv preprint arXiv:2512.07647, 2025

work page arXiv 2025
[15]

vAttention: Dy- namic memory management for serving LLMs without PagedAttention.ASPLOS,

Prabhu, R., Nayak, A., Mohan, J., Ram- jee, R., and Panwar, A. vAttention: Dy- namic memory management for serving LLMs without PagedAttention.ASPLOS,

work page
[16]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with Page- dAttention.SOSP, 2023. arXiv:2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Cocktail: Chunk-adaptive mixed-precision quantization for long-context LLM inference

Tao, W., Zhang, B., Qu, X., Wan, J., and Wang, J. Cocktail: Chunk-adaptive mixed-precision quantization for long-context LLM inference. arXiv preprint arXiv:2503.23294, 2025

work page arXiv 2025
[18]

P ., Liu, Z., and Barsoum, E

Joshi, V ., Brahma, P . P ., Liu, Z., and Barsoum, E. TaDA: Training-free recipe for decoding with adaptive KV cache compression and mean-centering.ACL Industry Track, 2025

work page 2025
[19]

Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Boroujeni, S. P . H., Mehrabi, N., Woods, P ., Hillesheim, G., and Razi, A. Don’t waste bits! Adaptive KV-cache quantization for lightweight on-device LLMs.arXiv preprint arXiv:2604.04722, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., and Han, S. QServe: W4A8KV4 quantization and system co- design for efficient LLM serving.MLSys, 2025. arXiv:2405.04532

work page arXiv 2025
[22]

PQ- Cache: Product quantization-based KVCache for long context LLM inference.SIGMOD,

Zhang, H., Ji, X., Chen, Y., Fu, F., Miao, X., Nie, X., Chen, W., and Cui, B. PQ- Cache: Product quantization-based KVCache for long context LLM inference.SIGMOD,

work page
[23]

RULER: What’s the real context size of your long- context language models?COLM, 2024

Hsieh, C.-P ., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER: What’s the real context size of your long- context language models?COLM, 2024

work page 2024
[24]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Zandieh, A., Daliri, M., Hadian, M., and Mir- rokni, V . TurboQuant: Online vector quan- tization with near-optimal distortion rate. ICLR, 2026. arXiv:2504.19874

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

JanusQuant: Accurate and efficient 2-bit KV cache quantization for long- context inference.PPoPP, 2026

Sun, C., Xia, Y., Wang, H., Yang, D., Zhou, X., and Cheng, D. JanusQuant: Accurate and efficient 2-bit KV cache quantization for long- context inference.PPoPP, 2026

work page 2026
[26]

PackIn- fer: Compute- and I/O-efficient attention for batched LLM inference.arXiv preprint arXiv:2602.06072, 2026

Ning, R., Zhang, W., and Lai, F. PackIn- fer: Compute- and I/O-efficient attention for batched LLM inference.arXiv preprint arXiv:2602.06072, 2026

work page arXiv 2026
[27]

Certified Quantized At- tention — artifact repository (ver- sion arxiv-v1)

Calver, D. Certified Quantized At- tention — artifact repository (ver- sion arxiv-v1). Zenodo, 2026. doi:10.5281/zenodo.19915933. A Total Variation Bound for Per- turbed Softmax We derive the total variation bound used in the key compression error analysis. The proof proceeds in three steps: bounding individual probability ratios, deriving the full-sequen...

work page doi:10.5281/zenodo.19915933 2026

[1] [1]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.NeurIPS, 2024. arXiv:2401.18079

work page arXiv 2024

[2] [2]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V ., Chen, B., and Hu, X. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.ICML, 2024. arXiv:2402.02750

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

QJL: 1-bit quantized JL transform for KV cache quan- tization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

Zandieh, A., Daliri, M., and Han, I. QJL: 1-bit quantized JL transform for KV cache quan- tization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

work page arXiv 2024

[4] [4]

Hosseini, S. M. T., Ardakani, A., and Gross, W. J. InnerQ: Hardware-aware tuning-free quantization of KV cache for large language models.arXiv preprint arXiv:2602.23200, 2026

work page internal anchor Pith review arXiv 2026

[5] [5]

HACK: Homomorphic acceleration via compression of the key-value cache for disaggregated LLM inference.SIGCOMM, 2025

Zhang, Z., Shen, H., Vargaftik, S., Ben Basat, R., Mitzenmacher, M., and Yu, M. HACK: Homomorphic acceleration via compression of the key-value cache for disaggregated LLM inference.SIGCOMM, 2025

work page 2025

[6] [6]

VecInfer: Efficient LLM inference with low-bit KV cache via outlier-suppressed vector quantization.arXiv preprint arXiv:2510.06175, 2025

Yao, D., Yang, C., Tong, Z., Lin, Z., Liu, W., Luan, J., and Wang, W. VecInfer: Efficient LLM inference with low-bit KV cache via outlier-suppressed vector quantization.arXiv preprint arXiv:2510.06175, 2025

work page arXiv 2025

[7] [7]

FlashInfer: Efficient and customizable attention engine for LLM inference serving.MLSys, 2025

Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., and Ceze, L. FlashInfer: Efficient and customizable attention engine for LLM inference serving.MLSys, 2025

work page 2025

[8] [8]

H 2O: Heavy-hitter oracle for efficient generative inference of large language models.NeurIPS,

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., and Chen, B. H 2O: Heavy-hitter oracle for efficient generative inference of large language models.NeurIPS,

work page

[9] [9]

Efficient streaming language models with attention sinks.ICLR, 2024

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks.ICLR, 2024

work page 2024

[10] [10]

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Yuan, J., Shinn, C., Xu, K., Cui, J., Klimi- ashvili, G., Xiao, G., Zheng, P ., Li, B., Zhou, Y., Ye, Z., You, W., Zheng, T., Brown, D., Wang, P ., Hoehnerbach, M., Cai, R., Demouth, J., Owens, J. D., Hu, X., Han, S., Liu, T., and Mao, H. BLASST: Dynamic blocked atten- tion sparsity via softmax thresholding.arXiv preprint arXiv:2512.12087, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

PSA: Progressive sparse attention for long-context inference.arXiv preprint arXiv:2503.00392, 2025

Zhou, Q., Yin, P ., Zuo, P ., and Cheng, J. PSA: Progressive sparse attention for long-context inference.arXiv preprint arXiv:2503.00392, 2025

work page arXiv 2025

[12] [12]

Twilight: Adaptive attention sparsity with hierarchical top-p pruning.arXiv preprint arXiv:2502.02770,

Lin, C., Tang, J., Yang, S., Wang, H., Tang, T., Tian, B., Stoica, I., Han, S., and Gao, M. Twilight: Adaptive attention sparsity with hierarchical top- p pruning.NeurIPS, 2025. 28 Runtime-Certified Bounded-Error Quantized Attention arXiv:2502.02770

work page arXiv 2025

[13] [13]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Zhang, J., Xiang, C., Huang, H., Wei, J., Xi, H., Zhu, J., and Chen, J. SpargeAttn: Accurate sparse attention accelerating any model infer- ence.arXiv preprint arXiv:2502.18137, 2025

work page arXiv 2025

[14] [14]

A mathematical the- ory of top-k sparse attention via total varia- tion distance.arXiv preprint arXiv:2512.07647, 2025

Tzachristas, G., Deng, L., Tzachristas, I., Zhang, G., and Chen, R. A mathematical the- ory of top-k sparse attention via total varia- tion distance.arXiv preprint arXiv:2512.07647, 2025

work page arXiv 2025

[15] [15]

vAttention: Dy- namic memory management for serving LLMs without PagedAttention.ASPLOS,

Prabhu, R., Nayak, A., Mohan, J., Ram- jee, R., and Panwar, A. vAttention: Dy- namic memory management for serving LLMs without PagedAttention.ASPLOS,

work page

[16] [16]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with Page- dAttention.SOSP, 2023. arXiv:2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Cocktail: Chunk-adaptive mixed-precision quantization for long-context LLM inference

Tao, W., Zhang, B., Qu, X., Wan, J., and Wang, J. Cocktail: Chunk-adaptive mixed-precision quantization for long-context LLM inference. arXiv preprint arXiv:2503.23294, 2025

work page arXiv 2025

[18] [18]

P ., Liu, Z., and Barsoum, E

Joshi, V ., Brahma, P . P ., Liu, Z., and Barsoum, E. TaDA: Training-free recipe for decoding with adaptive KV cache compression and mean-centering.ACL Industry Track, 2025

work page 2025

[19] [19]

Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Boroujeni, S. P . H., Mehrabi, N., Woods, P ., Hillesheim, G., and Razi, A. Don’t waste bits! Adaptive KV-cache quantization for lightweight on-device LLMs.arXiv preprint arXiv:2604.04722, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., and Han, S. QServe: W4A8KV4 quantization and system co- design for efficient LLM serving.MLSys, 2025. arXiv:2405.04532

work page arXiv 2025

[22] [22]

PQ- Cache: Product quantization-based KVCache for long context LLM inference.SIGMOD,

Zhang, H., Ji, X., Chen, Y., Fu, F., Miao, X., Nie, X., Chen, W., and Cui, B. PQ- Cache: Product quantization-based KVCache for long context LLM inference.SIGMOD,

work page

[23] [23]

RULER: What’s the real context size of your long- context language models?COLM, 2024

Hsieh, C.-P ., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER: What’s the real context size of your long- context language models?COLM, 2024

work page 2024

[24] [24]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Zandieh, A., Daliri, M., Hadian, M., and Mir- rokni, V . TurboQuant: Online vector quan- tization with near-optimal distortion rate. ICLR, 2026. arXiv:2504.19874

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

JanusQuant: Accurate and efficient 2-bit KV cache quantization for long- context inference.PPoPP, 2026

Sun, C., Xia, Y., Wang, H., Yang, D., Zhou, X., and Cheng, D. JanusQuant: Accurate and efficient 2-bit KV cache quantization for long- context inference.PPoPP, 2026

work page 2026

[26] [26]

PackIn- fer: Compute- and I/O-efficient attention for batched LLM inference.arXiv preprint arXiv:2602.06072, 2026

Ning, R., Zhang, W., and Lai, F. PackIn- fer: Compute- and I/O-efficient attention for batched LLM inference.arXiv preprint arXiv:2602.06072, 2026

work page arXiv 2026

[27] [27]

Certified Quantized At- tention — artifact repository (ver- sion arxiv-v1)

Calver, D. Certified Quantized At- tention — artifact repository (ver- sion arxiv-v1). Zenodo, 2026. doi:10.5281/zenodo.19915933. A Total Variation Bound for Per- turbed Softmax We derive the total variation bound used in the key compression error analysis. The proof proceeds in three steps: bounding individual probability ratios, deriving the full-sequen...

work page doi:10.5281/zenodo.19915933 2026