arxiv: 2604.26968 · v1 · submitted 2026-04-19 · 💻 cs.AR · cs.AI· cs.DC· cs.PF

Recognition: unknown

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:56 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.PF

keywords KV cache managementmulti-tier memoryGPU inferenceBayesian reuse predictionLLM servingattention architectureeviction policytime-to-first-token

0 comments

The pith

A six-tier KV cache with Bayesian reuse prediction projects 1.4-2.1x lower TTFT and 47% cost savings in large-scale LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a memory manager for key-value caches that currently limit how many requests large language models can handle on GPUs. It fixes three issues at once: memory over-allocation for complex attention layers like multi-head latent attention, confinement to fast but small GPU memory, and eviction rules that throw away data that will be needed again. The solution sizes caches exactly for each attention type, spreads them across GPU, CPU, CXL, NVMe, RDMA, and filesystem tiers, and uses a statistical predictor to keep the right blocks resident. Readers should care because these steps could let the same hardware serve more users at lower cost while keeping the first token response fast.

Core claim

The paper claims that an architecture-variant-aware sizing engine, a six-tier memory hierarchy extending from GPU HBM to parallel filesystems, and a Bayesian reuse predictor using Beta conjugate priors over 16 block-transition pairs, together with EMA-scored head-granular eviction and RoPE-aware prefetching, achieve 70-84% cache hit rates on replayed traces and project 1.4-2.1x TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction versus state-of-the-art baselines.

What carries the argument

The Bayesian reuse predictor with Beta conjugate priors over 16 (block-type, transition-type) pairs, paired with EMA-scored head-granular eviction and RoPE-aware prefetching, that decides which KV cache blocks to keep or fetch across tiers.

If this is right

Exact KV cache sizing for unsupported attention types such as multi-head latent attention removes up to 57x over-provisioning and supports up to 7.4x larger batch sizes.
The six-tier hierarchy increases effective KV cache capacity from 40 GB to over 38 TB per node while preserving sub-millisecond TTFT for hot entries.
70-84% hit rates from the Bayesian predictor and EMA eviction reduce recomputation and enable the projected throughput and cost gains.
Component validation on ShareGPT, LMSYS-Chat-1M, and agentic traces confirms the hit rates that underpin the analytical performance projections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hit rates remain high on live traffic, the same number of GPUs could support substantially more concurrent users without added hardware.
The sizing engine could be reused as a standalone tool to right-size KV caches in existing single-tier inference frameworks.
RoPE-aware prefetching logic might extend naturally to other positional encodings used in newer model families.

Load-bearing premise

The Bayesian reuse predictor will continue to deliver 70-84% hit rates and sub-millisecond TTFT for hot entries once the full six-tier hardware and real production workloads are in place.

What would settle it

Running the complete system on hardware that includes all six memory tiers and measuring hit rates, TTFT, and throughput on production traces from ShareGPT or LMSYS-Chat-1M to check whether they match the 70-84% and 1.4-2.1x projections.

Figures

Figures reproduced from arXiv: 2604.26968 by Sanjeev Rao Ganjihal.

read the original abstract

Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache sizing across all attention architectures--particularly multi-head latent attention (MLA), which is unsupported in general-purpose frameworks, resulting in up to 57x memory over-provisioning; (2) confinement of KV cache to a single memory tier (GPU HBM) despite the availability of a rich hierarchy spanning CPU DRAM, CXL-attached memory, NVMe via GPUDirect Storage, RDMA fabric, and parallel filesystems; and (3) reactive eviction policies that discard reusable state, forcing redundant recomputation. We present a unified system that addresses all three problems. Our architecture-variant-aware sizing engine computes exact memory requirements per attention type, enabling up to 7.4x higher batch sizes. A six-tier memory hierarchy extends effective KV cache capacity from 40 GB to over 38 TB per node while maintaining sub-millisecond time-to-first-token (TTFT) for hot entries. A Bayesian reuse predictor with Beta conjugate priors over 16 (block-type, transition-type) pairs achieves 70-84% cache hit rates, combined with EMA-scored head-granular eviction and RoPE-aware prefetching. Component-level validation on trace replay using ShareGPT, LMSYS-Chat-1M, and agentic workloads demonstrates 70-84% cache hit rates. Analytical projections combining validated component behavior with published hardware specifications indicate 1.4-2.1x projected TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction compared to state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper puts forward a concrete multi-tier KV cache design with Bayesian prediction that addresses MLA and capacity issues, though its speedup numbers are not yet backed by full-system experiments.

read the letter

The main thing here is a proposal for managing KV cache across multiple memory tiers using a Bayesian predictor to decide what to keep and prefetch. The claimed gains in TTFT and throughput come from calculations that combine trace-based hit rates with hardware specs, not from running the full system on actual hardware. What is new is the exact sizing for MLA attention, which avoids the over-provisioning in standard frameworks, and the integration of a six-tier hierarchy including CXL and RDMA. The RoPE-aware prefetching and the 16-type Bayesian model with Beta priors add specificity to how they model reuse across block transitions. The paper does well in validating the predictor component. Using trace replay on ShareGPT, LMSYS-Chat-1M, and agentic workloads to show 70-84% hit rates is concrete and useful for that part. The soft spots are in the performance projections. Without an integrated implementation, it's hard to know if moving data between tiers adds latency that eats into the sub-millisecond TTFT for hot entries or if the hit rates hold when the system is under full load with interleaved requests. The tuned parameters for the priors and EMA scoring might not transfer perfectly to new workloads. This is aimed at engineers and researchers working on scalable LLM inference serving. A reader interested in memory management for GPUs would find the hierarchy details and modeling approach worth looking at. It deserves peer review because it tackles a real and pressing issue with some novel combinations, even if it needs more empirical backing to be fully convincing.

Referee Report

3 major / 1 minor

Summary. The paper proposes a unified KV cache management system for large-scale GPU inference serving. It introduces an architecture-variant-aware sizing engine to compute exact memory needs across attention types (including unsupported MLA), a six-tier memory hierarchy extending effective capacity from 40 GB to over 38 TB per node, and a Bayesian reuse predictor using Beta conjugate priors over 16 (block-type, transition-type) pairs combined with EMA-scored head-granular eviction and RoPE-aware prefetching. Component-level trace replay on ShareGPT, LMSYS-Chat-1M, and agentic workloads validates 70-84% hit rates, with analytical projections indicating 1.4-2.1x TTFT reduction, 1.7-2.9x throughput gains, and 47% cost reduction versus baselines.

Significance. If the projections are borne out, the work could meaningfully advance cost-efficient inference by exploiting cheaper memory tiers and predictive reuse to support larger batches and reduce recomputation. The component-level trace-replay validation of the Bayesian predictor and EMA eviction provides a concrete foundation for the hit-rate claims.

major comments (3)

Abstract: The central performance claims (1.4-2.1x TTFT reduction, 1.7-2.9x throughput, 47% cost reduction) are obtained solely by analytical combination of 70-84% hit rates measured in separate trace-replay experiments with published per-tier bandwidth/latency numbers; no end-to-end measurements on integrated GPU+CPU+CXL+NVMe hardware are reported.
Abstract: The projection implicitly assumes that RoPE-aware prefetching and cross-tier movements add zero latency beyond the individual tier specifications and that the 70-84% hit rates remain unchanged when lower tiers are populated under realistic interleaved request patterns; this assumption is load-bearing for the claimed gains but untested in a full-system setting.
Abstract: The Bayesian predictor relies on fitted Beta priors for the 16 pairs and EMA decay factors tuned to the evaluated traces; the manuscript should demonstrate robustness of these parameters and hit-rate stability when the multi-tier hierarchy is actually exercised rather than projected.

minor comments (1)

Abstract: The claim of 'up to 57x memory over-provisioning' for MLA would benefit from an explicit calculation or reference to the sizing mismatch in current frameworks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of our unified KV cache system to improve cost-efficiency in large-scale inference. We address each major comment point by point below, with clarifications on our validation approach and commitments to revisions that strengthen the presentation without overstating the current results.

read point-by-point responses

Referee: Abstract: The central performance claims (1.4-2.1x TTFT reduction, 1.7-2.9x throughput, 47% cost reduction) are obtained solely by analytical combination of 70-84% hit rates measured in separate trace-replay experiments with published per-tier bandwidth/latency numbers; no end-to-end measurements on integrated GPU+CPU+CXL+NVMe hardware are reported.

Authors: We agree that the performance numbers are analytical projections that combine component-level hit rates (obtained via trace replay on the three workloads) with published per-tier bandwidth and latency specifications. This methodology follows common practice in systems research when a complete integrated testbed spanning all six tiers is not yet widely available. The trace-replay experiments already exercise the Bayesian predictor, EMA eviction, and RoPE-aware prefetching under realistic access patterns. In revision we will add an explicit Limitations subsection that details the analytical model, its conservative latency assumptions, and our plans for future end-to-end evaluation on emerging CXL/GPUDirect hardware. revision: partial
Referee: Abstract: The projection implicitly assumes that RoPE-aware prefetching and cross-tier movements add zero latency beyond the individual tier specifications and that the 70-84% hit rates remain unchanged when lower tiers are populated under realistic interleaved request patterns; this assumption is load-bearing for the claimed gains but untested in a full-system setting.

Authors: The trace-replay workloads (ShareGPT, LMSYS-Chat-1M, agentic) already contain interleaved request streams that populate and exercise the multi-tier hierarchy. Our analytical model uses published latency figures for each tier and does not assume zero additional latency for prefetching or movements; rather, it folds those costs into the per-tier numbers. We acknowledge that a full-system run would provide the strongest confirmation. We will revise the Evaluation and Abstract sections to state these assumptions more explicitly and will include a sensitivity study that varies prefetching and movement latencies to show how the projected gains degrade under more pessimistic assumptions. revision: partial
Referee: Abstract: The Bayesian predictor relies on fitted Beta priors for the 16 pairs and EMA decay factors tuned to the evaluated traces; the manuscript should demonstrate robustness of these parameters and hit-rate stability when the multi-tier hierarchy is actually exercised rather than projected.

Authors: The Beta conjugate priors and EMA factors were fitted on the same trace data used for validation, and the trace-replay already runs the full predictor across tier transitions. To strengthen the claim, we will add a dedicated robustness subsection (with accompanying figures) that reports hit-rate variation under perturbations of the prior parameters and EMA decay constants, as well as when the number of active tiers is varied. This will be placed in the main evaluation section rather than the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; projections combine measured component results with external hardware data.

full rationale

The paper validates the Bayesian reuse predictor's 70-84% hit rates via separate trace-replay experiments on ShareGPT and similar workloads, then analytically combines those empirical hit rates with published hardware bandwidth/latency specifications to project TTFT, throughput, and cost gains. No equation or step equates the projected speedups to the fitted Beta priors or EMA parameters by construction; the hit-rate numbers are outputs of validation, not inputs that are renamed as predictions. No self-citations, uniqueness theorems, or ansatzes are invoked to force the central claims. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on workload predictability assumptions and several statistical parameters chosen or fitted to traces; no new physical entities are postulated.

free parameters (2)

Beta conjugate priors for 16 (block-type, transition-type) pairs
Parameters of the Bayesian reuse predictor that model reuse probabilities from observed traces.
EMA decay factors for head-granular eviction scoring
Exponential moving average parameters used to score and evict cache heads.

axioms (2)

domain assumption Inference workload memory access patterns exhibit predictable reuse that can be captured by 16 block-transition categories.
Foundation for the Bayesian predictor achieving 70-84% hit rates.
domain assumption Hot KV cache entries can maintain sub-millisecond TTFT when served from non-GPU tiers with appropriate prefetching.
Required for the multi-tier hierarchy to deliver claimed latency benefits.

pith-pipeline@v0.9.0 · 5625 in / 1655 out tokens · 53590 ms · 2026-05-10T04:56:40.936563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwonet al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inProc. SOSP, pp. 611–626, 2023

2023
[2]

SGLang: Efficient Execution of Structured Language Model Programs,

L. Zhenget al., “SGLang: Efficient Execution of Structured Language Model Programs,” inProc. NSDI, 2024

2024
[3]

TensorRT-LLM: A High-Performance Inference Library for Large Language Models,

NVIDIA Corporation, “TensorRT-LLM: A High-Performance Inference Library for Large Language Models,” Tech. Rep., 2024

2024
[4]

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,

Y . Shenget al., “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,” inProc. ICML, pp. 31094– 31116, 2023

2023
[5]

DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale,

R. Y . Aminabadiet al., “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale,” inProc. SC, 2022

2022
[6]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,

A. Agrawalet al., “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve,” inProc. OSDI, pp. 117–134, 2024

2024
[7]

Attention is All You Need,

A. Vaswaniet al., “Attention is All You Need,” inProc. NeurIPS, vol. 30, 2017

2017
[8]

GQA: Training Generalized Multi-Query Attention from Multi-Head Checkpoints,

J. Ainslieet al., “GQA: Training Generalized Multi-Query Attention from Multi-Head Checkpoints,” inProc. EMNLP, pp. 4895–4901, 2023

2023
[9]

Fast Transformer Decoding: One Write-Head is All You Need

N. Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need,” Google Tech. Rep., 2019. arXiv:1911.02150

work page internal anchor Pith review arXiv 2019
[10]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Effi- cient Mixture-of-Experts Language Model,” Tech. Rep., 2024. arXiv:2405.04434

work page internal anchor Pith review arXiv 2024
[11]

DeepSeek-V3 Technical Report

DeepSeek-AI, “DeepSeek-V3 Technical Report,” Tech. Rep., 2024. arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,

T. Daoet al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” inProc. NeurIPS, vol. 35, pp. 16344–16359, 2022

2022
[13]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,

T. Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” inProc. ICLR, 2024

2024
[14]

RoFormer: Enhanced Transformer with Rotary Position Embedding,

J. Suet al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,”Neurocomputing, vol. 568, p. 127063, 2024

2024
[15]

H 2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,

Z. Zhanget al., “H 2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,” inProc. NeurIPS, vol. 36, 2023

2023
[16]

SnapKV: LLM Knows What You are Looking for Before Generation

Y . Liet al., “SnapKV: LLM Knows What You are Looking for Before Generation,” arXiv:2404.14469, 2024

work page internal anchor Pith review arXiv 2024
[17]

Efficient Streaming Language Models with Attention Sinks,

G. Xiaoet al., “Efficient Streaming Language Models with Attention Sinks,” inProc. ICLR, 2024

2024
[18]

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time,

Z. Liuet al., “Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time,” inProc. NeurIPS, vol. 36, 2023

2023
[19]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs,

S. Geet al., “Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs,” inProc. ICLR, 2024

2024
[20]

Mooncake: A KVCache -centric disaggregated architecture for LLM serving

R. Qinet al., “Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving,” Moonshot AI, arXiv:2407.00079, 2024

work page arXiv 2024
[21]

LMCache: Reducing TTFT for Long-Context LLM Applications via KV Cache Sharing,

X. Luet al., “LMCache: Reducing TTFT for Long-Context LLM Applications via KV Cache Sharing,” arXiv:2410.10224, 2024

work page arXiv 2024
[22]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache,

B. Linet al., “Infinite-LLM: Efficient LLM Service with DistAttention and Distributed KVCache,” arXiv:2401.02669, 2024

work page arXiv 2024
[23]

Compute Express Link (CXL) Specification, Revision 3.0,

CXL Consortium, “Compute Express Link (CXL) Specification, Revision 3.0,” Tech. Rep., 2022

2022
[24]

BEACON: Scalable Near-Memory-Computing Accel- erators with Application to Genome Assembly,

S. Angiziet al., “BEACON: Scalable Near-Memory-Computing Accel- erators with Application to Genome Assembly,” inProc. DAC, 2024

2024
[25]

Efficient LLM Inference with CXL-based Heterogeneous Memory,

Y . Sunet al., “Efficient LLM Inference with CXL-based Heterogeneous Memory,”IEEE Micro, vol. 44, no. 3, pp. 48–56, 2024

2024
[26]

CXL-SpecKV: Speculative KV Cache Prefetching via CXL-attached Memory,

Z. Wanget al., “CXL-SpecKV: Speculative KV Cache Prefetching via CXL-attached Memory,” arXiv:2406.04517, 2024

work page arXiv 2024
[27]

NVIDIA H100 Tensor Core GPU Architecture,

NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU Architecture,” Tech. Rep., 2023

2023
[28]

GPUDirect Storage: A Direct Path Between Storage and GPU Memory,

NVIDIA Corporation, “GPUDirect Storage: A Direct Path Between Storage and GPU Memory,” Tech. Rep., 2023

2023
[29]

GPUDirect Storage for High-Performance Data-Intensive Applications,

S. Liet al., “GPUDirect Storage for High-Performance Data-Intensive Applications,” inProc. SC, 2023

2023
[30]

Using RDMA Efficiently for Key-Value Services,

A. Kalia, M. Kaminsky, and D. G. Andersen, “Using RDMA Efficiently for Key-Value Services,” inProc. ACM SIGCOMM, pp. 295–306, 2014

2014
[31]

K. P. Murphy,Machine Learning: A Probabilistic Perspective. MIT Press, 2012

2012
[32]

On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples,

W. R. Thompson, “On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples,”Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933

1933
[33]

RDMA over Commodity Ethernet at Scale,

C. Guoet al., “RDMA over Commodity Ethernet at Scale,” inProc. ACM SIGCOMM, pp. 202–215, 2016

2016
[34]

Splitwise: Efficient Generative LLM Inference Using Phase Splitting,

P. Patelet al., “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” inProc. ISCA, pp. 118–132, 2024

2024
[35]

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,

Y . Zhonget al., “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” inProc. OSDI, pp. 193–210, 2024

2024
[36]

arXiv preprint arXiv:2412.19442 (2024) 22 Benjamin Probst, Andreas Happe, and Jürgen Cito

H. Xuet al., “A Survey on Efficient Inference for Large Language Models with Focus on KV Cache Management,” arXiv:2412.19442, 2024

work page arXiv 2024
[37]

NVIDIA Dynamo: Dynamic GPU Inference Serving,

NVIDIA Corporation, “NVIDIA Dynamo: Dynamic GPU Inference Serving,” Tech. Rep., 2025

2025
[38]

ShareGPT: Sharing ChatGPT Conversations,

ShareGPT Community, “ShareGPT: Sharing ChatGPT Conversations,”
[39]

https://sharegpt.com
[40]

Zheng, W.-L

L. Zhenget al., “LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset,” arXiv:2309.11998, 2024. ARTIFACTDESCRIPTIONAPPENDIX Software.The system integrates with vLLM 0.19 and SGLang 0.5.9 through their cache management interfaces. TensorRT-LLM integration uses its native C++ plugin interface. The system is proprietary; implementation details ar...

work page arXiv 2024