arxiv: 2605.08913 · v2 · submitted 2026-05-09 · 💻 cs.LG · cs.AR· cs.CL· cs.PF

Recognition: 1 theorem link

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:01 UTC · model grok-4.3

classification 💻 cs.LG cs.ARcs.CLcs.PF

keywords non-monotonic latencyMPS backendKV cachetransformer decodingautoregressive inferenceexecution regimeshardware accelerationlatency spikes

0 comments

The pith

Transformer decoding on Apple MPS exhibits non-monotonic latency with abrupt spikes up to 21x caused by KV cache interactions with discrete execution regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive inference is typically expected to have latency that increases smoothly with sequence length. This paper shows that on the Apple MPS backend this assumption fails, with latency changing abruptly across nearby decoding configurations and spiking by as much as 21 times in certain intervals. The anomalies appear mainly in the decode phase, persist even without KV caching though less severely, and are absent on CPU and CUDA backends. KV caching still provides net benefit but loses effectiveness sharply in the problematic regimes. These results indicate that hardware-specific execution regimes must be accounted for in benchmarking long-context models on MPS.

Core claim

Autoregressive inference on the Apple MPS backend enters discrete execution regimes during decoding that produce non-monotonic latency behavior, with spikes of up to 21x in specific decoding-budget intervals; these regimes arise from interactions between the key-value cache and the MPS hardware scheduler, are not explained by memory pressure alone, and do not occur on CPU or NVIDIA CUDA under the same conditions.

What carries the argument

Discrete MPS execution regimes that interact with the key-value cache size during the autoregressive decode phase.

If this is right

KV caching remains net beneficial but its speedup collapses within anomalous configurations.
Disabling the KV cache reduces but does not eliminate the non-monotonic behavior.
The anomalies are confined to the decode phase and specific to the MPS backend.
Coarse-grained benchmarking misses these discrete regime transitions.
Hardware-aware evaluation is necessary for reliable long-context inference on MPS devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model deployment on Apple hardware should include fine-grained latency sweeps across sequence lengths to avoid hidden performance cliffs.
Similar non-monotonic effects may appear on other accelerators with discrete scheduling regimes when KV cache sizes cross internal thresholds.
Optimization efforts for MPS should target smoothing the execution regime transitions rather than only reducing average latency.

Load-bearing premise

The observed latency spikes originate primarily from KV cache interactions with discrete MPS execution regimes rather than from unmeasured factors such as specific model implementation details, driver versions, or transient system load.

What would settle it

Running controlled latency measurements on MPS for decoding lengths inside and outside the reported anomalous intervals, both with and without KV caching, while holding model implementation, driver version, and system load fixed; absence of spikes would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.08913 by Willy Fitra Hendria.

**Figure 1.** Figure 1: Latency scaling by model and backend. Each panel shows average latency versus decoding budget (max tokens) for one model. CPU and CUDA exhibit smooth monotonic scaling across all tested lengths, while MPS shows sharp non-monotonic spikes for GPT-2 Medium and GPT-2 Large at specific decoding budgets. Contributions. • We identify non-monotonic latency behavior in MPS autoregressive decoding, with spikes up t… view at source ↗

**Figure 2.** Figure 2: Cross-device MPS latency comparison between M3 Max and M3 Pro. Each panel shows latency versus decoding budget (max tokens) for one model under MPS execution. The anomaly boundary shifts systematically across devices and model sizes: M3 Max exhibits spikes for larger models at smaller decoding budgets, while M3 Pro exhibits instability even for smaller models. regime remains consistently high, with standar… view at source ↗

**Figure 3.** Figure 3: Instability probe: average latency vs. max tokens for GPT-2 Medium on MPS at 16-token intervals. The shaded region marks the anomalous regime (512–624 tokens). Dashed lines indicate the onset and recovery boundaries. 4.4. KV Cache Ablation We perform a controlled ablation on GPT-2 Medium using MPS, varying KV cache state across decoding budgets from 128–768 tokens. All other conditions (model, prompt, 5 … view at source ↗

**Figure 4.** Figure 4: KV cache ablation for GPT-2 Medium on MPS. With KV cache enabled, latency exhibits a sharp non-monotonic spike at decoding budget 512. With KV cache disabled, latency is higher throughout and remains non-monotonic. Together, these results indicate that KV caching accentuates the observed latency discontinuity while still providing substantial speedup outside pathological decoding configurations. seed, and… view at source ↗

**Figure 5.** Figure 5: Prompt-length ablation for GPT-2 Medium on MPS. At max tokens = 512 the long prompt (total ≈577 tokens) enters the anomalous execution regime while the short prompt (total ≈515 tokens) remains normal, consistent with an influence of total sequence length on onset behavior. sequence lengths that differ by approximately 62 tokens at each configuration [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains beneficial overall, but its practical speedup collapses sharply within anomalous configurations, while cache-disabled decoding still exhibits residual non-monotonic behavior. These findings suggest that autoregressive decoding on MPS enters discrete execution regimes that are not captured by coarse-grained benchmarking, highlighting the importance of hardware-aware evaluation for long-context inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPS decoding shows real non-monotonic latency spikes up to 21x tied to KV cache, absent on CPU/CUDA, but the discrete-regime explanation stays inferential without profiling data.

read the letter

The main thing here is that the authors measured non-monotonic latency spikes during autoregressive decoding on Apple MPS, reaching 21x in narrow budget windows. The spikes are decode-specific, do not appear on CPU or CUDA under the same conditions, and interact strongly with KV cache usage. KV caching still helps overall but its speedup collapses in the bad intervals, while some non-monotonicity remains even without the cache. That is the core empirical observation. They back it with tests on GPT-2, BLOOM, and OPT, separating prefill from decode and ruling out gross memory pressure as the only driver. Those controlled comparisons are the paper's real strength and make the finding credible as a hardware-specific gotcha. The soft spot is the causal claim. They attribute the spikes to discrete execution regimes, but the evidence is high-level latency curves only. No kernel traces, command-buffer counts, or allocation logs are reported, so alternatives like JIT thresholds, driver scheduling, or measurement artifacts are not fully excluded. The abstract also gives no run counts, variance numbers, or exclusion criteria, which leaves the magnitude and repeatability of the 21x spikes harder to assess. This is the sort of result that matters for anyone running long-context inference on Apple silicon. A practitioner or hardware-aware benchmarking reader would get practical value from the observation itself. The central measurement holds up as a reported phenomenon even if the regime story needs more instrumentation. I would send it to peer review so referees can ask for the low-level data and statistical details.

Referee Report

2 major / 2 minor

Summary. The paper empirically documents non-monotonic latency behavior during autoregressive decoding on the Apple MPS backend. Across GPT-2, BLOOM, and OPT models, it reports abrupt latency spikes of up to 21x within narrow decoding-budget intervals, localized to the decode phase, absent on CPU and CUDA, and modulated by KV-cache usage. The central claim is that these spikes arise from interactions between the KV cache and discrete MPS execution regimes rather than from memory pressure alone.

Significance. If the non-monotonicity is reproducible and the regime attribution can be strengthened, the result is significant for hardware-aware inference research. It demonstrates that standard latency scaling assumptions fail on MPS and that KV-cache benefits can collapse in specific configurations, motivating finer-grained benchmarking for Apple Silicon deployments. The multi-model, multi-backend design provides a useful baseline observation even if causal instrumentation is limited.

major comments (2)

[§3 (Experimental Setup)] §3 (Experimental Setup): the manuscript provides no quantitative details on the number of timing runs per configuration, statistical tests for spike significance, or outlier exclusion criteria. This leaves the reported 21x spikes only partially supported and makes it difficult to assess whether the non-monotonicity is robust or sensitive to measurement noise.
[§5 (KV Cache Interactions)] §5 (KV Cache Interactions): the attribution of spikes to discrete MPS execution regimes remains inferential. High-level latency curves (with and without KV cache) are presented, but no low-level instrumentation—kernel traces, command-buffer counts, or allocation events—is reported to confirm regime switches. Alternative explanations such as JIT thresholds or driver scheduling therefore cannot be excluded on the current evidence.

minor comments (2)

[Abstract and §4] Abstract and §4: the exact decoding-budget intervals that trigger spikes are described only qualitatively; listing the concrete token ranges (e.g., 128–256) would improve reproducibility.
[Figures 2–4] Figures 2–4: add error bars or shaded regions indicating run-to-run variability so readers can judge the stability of the reported spikes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript documenting non-monotonic latency behavior on Apple MPS. The comments highlight important areas for strengthening the empirical support and causal attribution. We address each point below and will revise the manuscript to improve clarity and robustness.

read point-by-point responses

Referee: [§3 (Experimental Setup)] §3 (Experimental Setup): the manuscript provides no quantitative details on the number of timing runs per configuration, statistical tests for spike significance, or outlier exclusion criteria. This leaves the reported 21x spikes only partially supported and makes it difficult to assess whether the non-monotonicity is robust or sensitive to measurement noise.

Authors: We agree that these methodological details were insufficiently reported. In the revised manuscript we will explicitly state that each configuration was measured over 100 independent runs using median latency to mitigate outlier effects, with no formal statistical hypothesis tests applied because the effect sizes (up to 21x) are large and the spikes appear consistently across all runs, models, and random seeds. Variance information and run counts will be added to the experimental setup section and figure captions. revision: yes
Referee: [§5 (KV Cache Interactions)] §5 (KV Cache Interactions): the attribution of spikes to discrete MPS execution regimes remains inferential. High-level latency curves (with and without KV cache) are presented, but no low-level instrumentation—kernel traces, command-buffer counts, or allocation events—is reported to confirm regime switches. Alternative explanations such as JIT thresholds or driver scheduling therefore cannot be excluded on the current evidence.

Authors: We acknowledge that the regime attribution is inferential and based on high-level patterns rather than direct instrumentation. Apple MPS being a closed-source backend precludes access to kernel traces or command-buffer counts. In the revision we will add a dedicated paragraph discussing alternative explanations (JIT thresholds, driver scheduling) and present additional ablation results that vary model size and sequence length to help distinguish them. We will also moderate the language from 'arise from' to 'are consistent with' discrete regime shifts while retaining the core empirical observations. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical latency measurements

full rationale

The manuscript contains no equations, fitted parameters, derivations, or self-citations that reduce any claim to its own inputs. All reported results are direct observations of wall-clock latency across decoding budgets, model families (GPT-2/BLOOM/OPT), backends (MPS/CPU/CUDA), and KV-cache settings. The attribution to discrete execution regimes is presented as an inference from the pattern of spikes, not as a mathematical consequence of any prior result or definition within the paper. Because the work is self-contained observational benchmarking with no load-bearing derivation chain, circularity is absent.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that latency measurements isolate the decode phase and that KV cache state is the dominant variable; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Latency measurements accurately isolate decode-phase behavior from prefill and system noise
Stated in the controlled-experiments paragraph of the abstract

pith-pipeline@v0.9.0 · 5488 in / 1163 out tokens · 37513 ms · 2026-05-15T05:01:40.410338+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Transformers: State-of-the-Art Natural Language Processing , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

work page 2020
[3]

NeurIPS EMC

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. NeurIPS EMC

work page
[4]

2018 , url =

Improving Language Understanding by Generative Pre-Training , author =. 2018 , url =

work page 2018
[5]

2019 , howpublished=

Language Models are Unsupervised Multitask Learners , author=. 2019 , howpublished=

work page 2019
[6]

2026 , howpublished=

LitServe , author=. 2026 , howpublished=

work page 2026
[7]

2022 , howpublished=

MPS backend , author=. 2022 , howpublished=

work page 2022
[8]

2026 , howpublished=

Metal Performance Shaders , author=. 2026 , howpublished=

work page 2026
[9]

2023 , howpublished=

Apple unveils the new MacBook Pro featuring the M3 family of chips, making the world's best pro laptop even better , author=. 2023 , howpublished=

work page 2023
[10]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. arXiv preprint arXiv:2309.06180 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. FlashAttention: Fast and Memory-Efficient Exact Attention with. Advances in Neural Information Processing Systems , volume=

work page
[12]

arXiv preprint arXiv:2306.14048 , year=

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. arXiv preprint arXiv:2306.14048 , year=

work page arXiv
[13]

and Rastegari, Mohammad and Farajtabar, Mehrdad , booktitle=

Alizadeh, Keivan and Mirzadeh, Iman and Belenko, Dmitry and Khatamifard, Karen and Cho, Minsik and Del Mundo, Carlo C. and Rastegari, Mohammad and Farajtabar, Mehrdad , booktitle=

work page
[14]

Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of

Rajesh, Varun and Jodhpurkar, Om and Anbuselvan, Pooja and Singh, Mantinder and Jallepali, Ashok and Godbole, Shantanu and Sharma, Pradeep Kumar and Shrivastava, Hritvik , year=. Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of. 2511.05502 , archivePrefix=

work page arXiv
[15]

2025 , eprint=

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective , author=. 2025 , eprint=

work page 2025
[16]

2410.08391 , archivePrefix=

Horton, Maxwell and Cao, Qingqing and Sun, Chenfan and Jin, Yanzi and Mehta, Sachin and Rastegari, Mohammad and Nabi, Moin , year=. 2410.08391 , archivePrefix=

work page arXiv
[17]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

OPT: Open Pre-trained Transformer Language Models

OPT: Open Pre-trained Transformer Language Models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

2018 , howpublished=

NVIDIA Tesla T4 GPU , author=. 2018 , howpublished=

work page 2018
[20]

2024 , note =

Profiling Apple Silicon Performance for ML Training , author =. 2024 , note =

work page 2024
[21]

Proceedings of Machine Learning and Systems (MLSys) , year =

Pope, Reiner and Douglas, Sholto and Chowdhery, Aakanksha and Devlin, Jacob and Bradbury, James and Heek, Jonathan and Li, Peisun and Agrawal, Shivani and Belov, Dean and Wu, Yonghui , title =. Proceedings of Machine Learning and Systems (MLSys) , year =

work page
[22]

Training Free Exponential Context Extension via Cascading

Jeffrey Willette and Heejun Lee and Youngwan Lee and Myeongjae Jeon and Sung Ju Hwang , booktitle=. Training Free Exponential Context Extension via Cascading. 2025 , url=

work page 2025
[23]

Dissecting the NVidia Turing T4 GPU via Microbenchmarking

Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking , author =. arXiv preprint arXiv:1903.07486 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1903
[24]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

work page