Recognition: unknown
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
Pith reviewed 2026-05-10 11:19 UTC · model grok-4.3
The pith
FP16 KV caching produces deterministic token sequence divergence from cache-free recomputation due to differing accumulation orders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KV caching and cache-free execution employ different floating-point accumulation orderings; under FP16 these orderings are non-associative, so the two paths generate divergent token sequences with 100 percent consistency across models and sampling methods. The divergence vanishes under FP32, confirming non-associativity as the sole driver, and activation patching experiments localize the causal state to the persistent KV cache rather than transient activations.
What carries the argument
The stateful KV cache that reorders the sequence of floating-point accumulations in attention and linear layers relative to full recomputation.
If this is right
- Divergence appears deterministically even under greedy decoding, ruling out sampling as the cause.
- Cache-ON paths produce measurably higher accuracy than cache-OFF in eight of nine tested conditions.
- Models using Grouped-Query Attention exhibit sharp early-layer divergence while sliding-window attention produces uniform drift across layers.
- Activation patching of the full residual stream cannot restore the cache-free token trajectory.
- Raising precision to FP32 reduces numerical drift by eight orders of magnitude and eliminates all token flips.
Where Pith is reading between the lines
- Implementers of serving systems may need precision-aware KV cache updates or compensated summation to restore equivalence.
- Reproducibility audits for LLM inference should include explicit cache versus no-cache comparisons rather than assuming equivalence.
- Similar ordering-induced divergence could appear in other inference optimizations that change computation sequence, such as fused kernels or speculative decoding.
Load-bearing premise
The divergence is produced solely by FP16 non-associativity of accumulation order and is unaffected by any other implementation differences between the two execution paths.
What would settle it
Re-executing the identical inference paths in FP32 and observing whether the token-flip rate between cache-ON and cache-OFF drops from 100 percent to exactly 0 percent.
Figures
read the original abstract
KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100\% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0\%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma's larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that KV caching, long assumed to be numerically equivalent to cache-free autoregressive inference in transformers, produces deterministic token-sequence divergence under FP16 due to non-associativity of floating-point accumulation. Experiments across LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B on GSM8K show 100% divergence for all sampling strategies (including greedy), with cache-ON often more accurate; FP32 falsification reduces divergence by eight orders of magnitude to 0% token flips; layer-wise drift analysis reveals architecture-specific patterns (sharp early divergence in GQA models vs. uniform in Gemma); and activation patching localizes the causal difference to the stateful KV cache rather than residual activations.
Significance. If the central empirical result holds, the finding is significant for LLM systems research: it demonstrates that a ubiquitous inference optimization is not numerically neutral under standard precision, with systematic effects on output and accuracy. The controlled falsification (FP32), mechanistic localization via patching, and architecture-specific drift profiling supply a concrete framework for diagnosing numerical instability in inference engines, which could inform precision choices, cache designs, and verification practices in production deployments.
minor comments (3)
- [Abstract and §4] Abstract and §4 (results): the claim of cache-ON yielding higher accuracy in '8 of 9 conditions' is presented without enumerating the nine conditions or reporting the per-condition accuracy deltas and standard errors; adding this breakdown would make the systematic-direction argument easier to evaluate.
- [Methods] Methods: while the FP32 control and activation-patching experiments are described at a high level, the manuscript does not provide pseudocode or sufficient implementation detail on how the cache-ON and cache-OFF forward passes were made identical except for the KV cache (e.g., exact tensor shapes, matmul ordering, and any fused kernels), which is necessary for independent reproduction of the accumulation-order effect.
- [§5] §5 (layer-wise drift): the architecturally predictable patterns are summarized in text; a supplementary figure plotting per-layer cosine drift or norm difference for each model would improve clarity and allow readers to verify the claimed distinction between GQA and sliding-window attention behaviors.
Simulated Author's Rebuttal
We thank the referee for their accurate summary of our work and for the positive assessment of its significance to LLM systems research. The controlled falsification, architecture-specific analysis, and localization experiments appear to have been viewed as providing a useful diagnostic framework. We agree with the recommendation for minor revision and will prepare an updated manuscript incorporating any editorial improvements.
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical study of numerical divergence between KV-cache and cache-free inference paths under FP16, supported by direct side-by-side execution comparisons, FP32 falsification controls that eliminate token flips, layer-wise drift measurements, and activation-patching localization. No equations, fitted parameters, self-citations, or ansatzes appear in the provided text that would reduce any claim to its own inputs by construction. The central non-equivalence result is established through falsifiable experimental contrasts rather than definitional or self-referential steps, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Floating-point addition in FP16 is non-associative
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
2023
-
[3]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery
2021
- [4]
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
2022
-
[8]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix mul- tiplication for transformers at scale.Advances in neural information processing systems, 35:30318– 30332, 2022. 19
2022
-
[9]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quan- tization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review arXiv 2022
-
[10]
What every computer scientist should know about floating-point arithmetic.ACM computing surveys (CSUR), 23(1):5–48, 1991
David Goldberg. What every computer scientist should know about floating-point arithmetic.ACM computing surveys (CSUR), 23(1):5–48, 1991
1991
-
[11]
Deep learning with limited numerical precision
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. InInternational conference on machine learning, pages 1737–1746. PMLR, 2015
2015
-
[12]
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b.ArXiv, abs/231...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
2023
-
[14]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
2024
-
[15]
Locating and editing factual associ- ations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associ- ations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022
2022
-
[16]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017
work page internal anchor Pith review arXiv 2017
-
[17]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review arXiv 2022
-
[18]
Fast Transformer Decoding: One Write-Head is All You Need
Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review arXiv 1911
-
[19]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L ´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review arXiv 2024
-
[20]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Investigating gender bias in language models using causal mediation analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa- tion Processing Systems, volume 33, pages 12388–12401. Curran Assoc...
2020
-
[22]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022
work page internal anchor Pith review arXiv 2022
-
[23]
Smoothquant: Ac- curate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Ac- curate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023
2023
-
[24]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming lan- guage models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
Understanding and mitigating numerical sources of nondeterminism in LLM inference
Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and mitigating numerical sources of nondeterminism in LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[26]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the ai ocean: A survey on hallucination in large language models.ArXiv, abs/2309.01219, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 21
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.