pith. machine review for the scientific record. sign in

arxiv: 2604.15409 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI

Recognition: unknown

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

Lei Xu, Ranjith Chodavarapu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cacheFP16 precisionnumerical divergenceautoregressive inferencefloating-point non-associativitytoken sequenceLLM inferenceactivation patching
0
0 comments X

The pith

FP16 KV caching produces deterministic token sequence divergence from cache-free recomputation due to differing accumulation orders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the long-held assumption of numerical equivalence between KV-cached and cache-free inference fails in standard FP16 precision. Different ordering of floating-point sums in the two paths exploits non-associativity of low-precision addition, producing small errors that propagate through layers and flip entire output tokens. Experiments on three models with GSM8K show this divergence is deterministic, occurs 100 percent of the time even with greedy decoding, and is eliminated when precision is raised to FP32. The finding matters for any deployed autoregressive system because it means cache optimization, previously treated as a safe speedup, systematically alters model outputs in a reproducible way.

Core claim

KV caching and cache-free execution employ different floating-point accumulation orderings; under FP16 these orderings are non-associative, so the two paths generate divergent token sequences with 100 percent consistency across models and sampling methods. The divergence vanishes under FP32, confirming non-associativity as the sole driver, and activation patching experiments localize the causal state to the persistent KV cache rather than transient activations.

What carries the argument

The stateful KV cache that reorders the sequence of floating-point accumulations in attention and linear layers relative to full recomputation.

If this is right

  • Divergence appears deterministically even under greedy decoding, ruling out sampling as the cause.
  • Cache-ON paths produce measurably higher accuracy than cache-OFF in eight of nine tested conditions.
  • Models using Grouped-Query Attention exhibit sharp early-layer divergence while sliding-window attention produces uniform drift across layers.
  • Activation patching of the full residual stream cannot restore the cache-free token trajectory.
  • Raising precision to FP32 reduces numerical drift by eight orders of magnitude and eliminates all token flips.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Implementers of serving systems may need precision-aware KV cache updates or compensated summation to restore equivalence.
  • Reproducibility audits for LLM inference should include explicit cache versus no-cache comparisons rather than assuming equivalence.
  • Similar ordering-induced divergence could appear in other inference optimizations that change computation sequence, such as fused kernels or speculative decoding.

Load-bearing premise

The divergence is produced solely by FP16 non-associativity of accumulation order and is unaffected by any other implementation differences between the two execution paths.

What would settle it

Re-executing the identical inference paths in FP32 and observing whether the token-flip rate between cache-ON and cache-OFF drops from 100 percent to exactly 0 percent.

Figures

Figures reproduced from arXiv: 2604.15409 by Lei Xu, Ranjith Chodavarapu.

Figure 1
Figure 1. Figure 1: Mean KL divergence between cache-ON and cache-OFF output distributions at each decode step [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-layer hidden state drift magnitude (L2 norm of the difference vector between cache-ON and [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FP16 versus FP32 KL divergence across all three models ( [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Decision boundary analysis: four views of the flip index–KL relationship. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Activation patching results for all three models. Each point shows the percentage KL divergence [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100\% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0\%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma's larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that KV caching, long assumed to be numerically equivalent to cache-free autoregressive inference in transformers, produces deterministic token-sequence divergence under FP16 due to non-associativity of floating-point accumulation. Experiments across LLaMA-2-7B, Mistral-7B-v0.3, and Gemma-2-2B on GSM8K show 100% divergence for all sampling strategies (including greedy), with cache-ON often more accurate; FP32 falsification reduces divergence by eight orders of magnitude to 0% token flips; layer-wise drift analysis reveals architecture-specific patterns (sharp early divergence in GQA models vs. uniform in Gemma); and activation patching localizes the causal difference to the stateful KV cache rather than residual activations.

Significance. If the central empirical result holds, the finding is significant for LLM systems research: it demonstrates that a ubiquitous inference optimization is not numerically neutral under standard precision, with systematic effects on output and accuracy. The controlled falsification (FP32), mechanistic localization via patching, and architecture-specific drift profiling supply a concrete framework for diagnosing numerical instability in inference engines, which could inform precision choices, cache designs, and verification practices in production deployments.

minor comments (3)
  1. [Abstract and §4] Abstract and §4 (results): the claim of cache-ON yielding higher accuracy in '8 of 9 conditions' is presented without enumerating the nine conditions or reporting the per-condition accuracy deltas and standard errors; adding this breakdown would make the systematic-direction argument easier to evaluate.
  2. [Methods] Methods: while the FP32 control and activation-patching experiments are described at a high level, the manuscript does not provide pseudocode or sufficient implementation detail on how the cache-ON and cache-OFF forward passes were made identical except for the KV cache (e.g., exact tensor shapes, matmul ordering, and any fused kernels), which is necessary for independent reproduction of the accumulation-order effect.
  3. [§5] §5 (layer-wise drift): the architecturally predictable patterns are summarized in text; a supplementary figure plotting per-layer cosine drift or norm difference for each model would improve clarity and allow readers to verify the claimed distinction between GQA and sliding-window attention behaviors.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of our work and for the positive assessment of its significance to LLM systems research. The controlled falsification, architecture-specific analysis, and localization experiments appear to have been viewed as providing a useful diagnostic framework. We agree with the recommendation for minor revision and will prepare an updated manuscript incorporating any editorial improvements.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical study of numerical divergence between KV-cache and cache-free inference paths under FP16, supported by direct side-by-side execution comparisons, FP32 falsification controls that eliminate token flips, layer-wise drift measurements, and activation-patching localization. No equations, fitted parameters, self-citations, or ansatzes appear in the provided text that would reduce any claim to its own inputs by construction. The central non-equivalence result is established through falsifiable experimental contrasts rather than definitional or self-referential steps, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the standard mathematical property of FP16 non-associativity and the empirical observation that cache and no-cache paths differ in accumulation order; no free parameters or invented entities are introduced.

axioms (1)
  • standard math Floating-point addition in FP16 is non-associative
    Invoked to explain why different accumulation orders produce different results

pith-pipeline@v0.9.0 · 5590 in / 1252 out tokens · 30025 ms · 2026-05-10T11:19:54.722224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 14 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  3. [3]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery

  4. [4]

    Lingjiao Chen, Matei Zaharia, and James Y . Zou. How is chatgpt’s behavior changing over time? ArXiv, abs/2307.09009, 2023

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  6. [6]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  7. [7]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  8. [8]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix mul- tiplication for transformers at scale.Advances in neural information processing systems, 35:30318– 30332, 2022. 19

  9. [9]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quan- tization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  10. [10]

    What every computer scientist should know about floating-point arithmetic.ACM computing surveys (CSUR), 23(1):5–48, 1991

    David Goldberg. What every computer scientist should know about floating-point arithmetic.ACM computing surveys (CSUR), 23(1):5–48, 1991

  11. [11]

    Deep learning with limited numerical precision

    Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. InInternational conference on machine learning, pages 1737–1746. PMLR, 2015

  12. [12]

    Mistral 7B

    Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b.ArXiv, abs/231...

  13. [13]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  14. [14]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  15. [15]

    Locating and editing factual associ- ations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associ- ations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

  16. [16]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017

  17. [17]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

  18. [18]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

  19. [19]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L ´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  20. [20]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  21. [21]

    Investigating gender bias in language models using causal mediation analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa- tion Processing Systems, volume 33, pages 12388–12401. Curran Assoc...

  22. [22]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022

  23. [23]

    Smoothquant: Ac- curate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Ac- curate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  24. [24]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming lan- guage models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  25. [25]

    Understanding and mitigating numerical sources of nondeterminism in LLM inference

    Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and mitigating numerical sources of nondeterminism in LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  26. [26]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the ai ocean: A survey on hallucination in large language models.ArXiv, abs/2309.01219, 2023

  27. [27]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 21