pith. machine review for the scientific record. sign in

arxiv: 2605.00831 · v1 · submitted 2026-03-26 · 💻 cs.DC · cs.AI· cs.PF

Recognition: no theorem link

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:34 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.PF
keywords fault-tolerant servingKV cache protectionerasure codingLLM inferencecheckpointingrecovery latencydistributed systems
0
0 comments X

The pith

GhostServe applies erasure coding to the KV cache in host memory for fast failure recovery in LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving for long-running tasks like million-token agents is prone to failures that waste computation on the large KV cache. The paper introduces GhostServe, which uses erasure coding to create parity data stored in host memory while the KV cache streams during inference. Upon a device failure, the lost cache portion is reconstructed from the parities, avoiding expensive full recomputation or full replication. This leads to measured reductions in checkpointing latency of up to 2.7 times and recovery latency of 2.1 times for a batch, plus better overall response times. The method aims to make high-availability LLM services more practical and cost-effective at scale.

Core claim

By applying erasure coding to generate parity shards for the streaming KV cache and storing those shards in host memory, GhostServe allows the inference process to resume after device failures through fast reconstruction of the lost cache state instead of costly recomputation.

What carries the argument

Erasure coding on the KV cache to produce and store parity shards in host memory for shadow protection and reconstruction.

If this is right

  • Checkpointing latency is reduced by up to 2.7x for a single batch compared to existing methods.
  • Recovery latency is reduced by 2.1x for a single batch.
  • Median response latency improves by 1.2x in the presence of system failures.
  • Seamless resumption of inference without full recomputation or state replication.
  • Support for fault-tolerant serving of long-sequence LLM applications at lower cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Longer context windows would benefit more since KV cache growth amplifies the savings.
  • The approach might extend to protecting other transient states in distributed inference.
  • Combining with other fault tolerance techniques could further improve availability.
  • Adoption could lower operational costs for cloud LLM providers by reducing wasted compute on failures.

Load-bearing premise

Erasure coding overhead remains small enough that cache reconstruction from host memory parities is faster than recomputing the sequence from prior state.

What would settle it

Measure the end-to-end time from failure detection to resumed inference for a long sequence using GhostServe versus a baseline that restarts computation, checking if the reconstruction path is faster.

Figures

Figures reproduced from arXiv: 2605.00831 by Chinmay Dhanraj Nehate, Jun Wang, Shakya Jayakody, Youpeng Zhao.

Figure 1
Figure 1. Figure 1: (a) Left: Prefill and decode stages in LLM inference. In the prefill stage, all queries are processed with keys and values computed and stored in memory. During the subsequent decode stage, only the attention for new queries is needed, significantly reducing attention computation. (b) Middle: Chunked prefill mechanism. During prefill, the input is divided into chunks of tokens, where the corresponding KV c… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of checkpointing latency and memory over￾head of our erasure coding (8:2) method and state replication (Strati et al., 2024) during prefill. Results are profiled with LLaMA-3-70B using SGLang with a batch size of 16, and varying input sequence lengths (32/64K), a chunk size of 2K, in tensor parallelism (TP=8) across 8×H200 GPUs. 3 MOTIVATION While recent efforts have dramatically improved LLM se… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Left: System overview of GhostServe. (b) Right: Illustration of the chunk-level checkpointing and load-balancing strategy in terms of GPU execution timeline. As KV cache chunks are generated (T1, T2, T3, T4), a different GPU is assigned in a round-robin fashion to gather and compute the corresponding parity chunk, thereby distributing the checkpointing overhead. The computation for subsequent chunks th… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of different fault-tolerant methods. The latency results are measured with a batch size of 16, chunk size of 2K, and an output length of 4K, using varying input lengths that range from 2K to 64K. I/O overhead represents the total I/O latency incurred during the checkpointing process. Recovery latency denotes the time required to restore 50% of the chunks to resume inference. All mode… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of different fault-tolerant methods in online serving. Here, we measure P50/99 latency and effective￾inference-time-ratio (EITR) under both failure-free and failure-induced environments. Faults are injected at random steps with a failure rate 15%. lack a sufficient recovery mechanism for the KV cache, and employ a simple recomputation technique when an interruption occurs (NVIDIA, 20… view at source ↗
Figure 6
Figure 6. Figure 6: Kernel Microbenchmark. All experiments are conducted using LLaMA-3-70B with a batch size of 16. (a) Left: Performance breakdown for erasure coding kernel during checkpointing and recovery for different chunk sizes. (b) Right: Impact of implementation method (PyTorch vs CUDA), kernel fusion, and CUDA graph on the erasure coding performance for different chunk sizes [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cost-benefit analysis for serving LLaMA-3-70B model over the entire serving traces. Here, we compare the EITR and MTTR for different methods under varying failure rates (5%∼15%). against naive recomputation, respectively. Third, Ghost￾Serve achieves consistently high EITR (> 90%) compared to baseline methods. In particular, GhostServe improves upon replication by an average of 23% under failures. The benef… view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity Studies. All experiments are conducted using LLaMA-3-70B with a chunk size of 2K. Recovery latency is the time required to restore 50% of the KV cache. (a) Left: Performance comparison of GhostServe under different parity ratios. (b) Top Right: Performance comparison of different fault-recovery methods with varying batch sizes and TP sizes. (c) Bottom Right: Ablation studies on the impact of re… view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison of different methods when scal￾ing to million-tokens. Results are reported using LLaMA-3-70B, with a batch size of 1, chunk size of 2K, and output length of 4K. Nonetheless, these systems do not consider the aspect of fault tolerance and resort to naive recomputation for KV cache recovery. Furthermore, methods like Dej´ aVu induce ` too much overhead for these high-performance system… view at source ↗
read the original abstract

The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software faults, leading to costly job failures, wasted resources, and degraded user experience. The stateful key-value (KV) cache, which grows with the sequence length, presents a central challenge as it is a critical and vulnerable component in distributed serving systems. In this work, we propose GhostServe, a novel checkpointing solution to facilitate fault-tolerant LLM serving. Specifically, GhostServe protects the streaming KV cache in the shadow by applying erasure coding to generate and store the parity shards in host memory. In the event of device failures, GhostServe enables fast reconstruction of the lost KV cache, allowing the inference process to resume seamlessly without costly full recomputation or state replication. Evaluations demonstrate that GhostServe reduces checkpointing latency by up to 2.7x and recovery latency by 2.1x for a single batch, and 1.2x median response latency compared to existing methods, in the presence of system failures, paving the way for high-availability and cost-effective LLM serving at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GhostServe, a checkpointing system for fault-tolerant LLM serving that applies erasure coding to protect the streaming KV cache by storing parity shards in host memory. Upon device failure, it reconstructs the lost cache state to resume inference without full recomputation or replication. Evaluations claim up to 2.7x lower checkpointing latency, 2.1x lower recovery latency for a single batch, and 1.2x lower median response latency versus existing methods under failures.

Significance. If the reconstruction remains faster than recomputation at scale, the work addresses a key reliability gap for long-running, million-token LLM inference services. The lightweight host-memory parity approach could reduce overhead compared to full replication while enabling seamless recovery, supporting higher availability in distributed serving systems.

major comments (2)
  1. [§5.3] §5.3 (Recovery Latency Evaluation): The 2.1x recovery improvement is shown only against other checkpointing baselines; no head-to-head timing versus full forward-pass recomputation of the KV cache is reported for sequences of 1M+ tokens, leaving the central claim that erasure-coded reconstruction avoids costly recomputation untested.
  2. [§4.2] §4.2 (Erasure Coding Design): The mechanism for maintaining parity shards incrementally as the KV cache appends tokens is not fully specified (e.g., per-token vs. batched updates); without quantified PCIe/host bandwidth or CPU overhead scaling with sequence length, it is unclear whether the approach remains low-overhead during normal operation for growing caches.
minor comments (2)
  1. Abstract and §5: The reported speedups (2.7x, 2.1x, 1.2x) should include the exact model sizes, batch configurations, and number of runs with variance to allow verification of the multipliers.
  2. [Figure 4] Figure 4 (or equivalent latency plots): Add error bars or confidence intervals to all bar graphs showing checkpointing and recovery times.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to strengthen the presentation and evaluation where needed.

read point-by-point responses
  1. Referee: [§5.3] §5.3 (Recovery Latency Evaluation): The 2.1x recovery improvement is shown only against other checkpointing baselines; no head-to-head timing versus full forward-pass recomputation of the KV cache is reported for sequences of 1M+ tokens, leaving the central claim that erasure-coded reconstruction avoids costly recomputation untested.

    Authors: We agree that a direct comparison to full recomputation would provide stronger support for the central claim. In the revised manuscript we will add experiments that measure the wall-clock time required to recompute the KV cache from scratch via forward passes for sequences of 1M+ tokens and directly contrast these times with GhostServe reconstruction latency under the same hardware configuration. This will quantify the savings relative to recomputation rather than only to other checkpointing schemes. revision: yes

  2. Referee: [§4.2] §4.2 (Erasure Coding Design): The mechanism for maintaining parity shards incrementally as the KV cache appends tokens is not fully specified (e.g., per-token vs. batched updates); without quantified PCIe/host bandwidth or CPU overhead scaling with sequence length, it is unclear whether the approach remains low-overhead during normal operation for growing caches.

    Authors: We will expand Section 4.2 to fully specify the incremental parity-update procedure, clarifying whether parity shards are refreshed on a per-token basis or in batches and describing the exact data movement between GPU and host memory. In addition, we will include new measurements of CPU utilization and PCIe bandwidth consumption as functions of sequence length during steady-state serving, demonstrating that the overhead remains negligible even as the KV cache grows to millions of tokens. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with no self-referential derivation

full rationale

The paper presents GhostServe as an engineering system that applies standard erasure coding to KV-cache shards stored in host memory, then measures checkpointing and recovery latencies on real hardware. No equations derive a result from itself; no parameters are fitted to a subset and then relabeled as predictions; no uniqueness theorem or ansatz is imported via self-citation to force the design. All performance numbers (2.7× checkpointing, 2.1× recovery) are direct wall-clock measurements against baselines, not algebraic identities. The central premise that reconstruction can be faster than recomputation is an empirical claim left open to falsification by the reported timings rather than a definitional tautology. Consequently the derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The design rests on standard assumptions about failure models and erasure coding properties from distributed systems literature; no new free parameters, axioms, or invented entities are introduced beyond the system name itself.

pith-pipeline@v0.9.0 · 5527 in / 1009 out tokens · 25976 ms · 2026-05-15T00:34:32.118537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,

    Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., and Ramjee, R. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369,

  2. [2]

    No request left behind: Tackling heterogeneity in long-context llm inference with medha.arXiv preprint arXiv:2409.17264,

    Agrawal, A., Qiu, H., Chen, J., Goiri, ´I., Zhang, C., Shahid, R., Ramjee, R., Tumanov, A., and Choukse, E. No request left behind: Tackling heterogeneity in long-context llm inference with medha.arXiv preprint arXiv:2409.17264,

  3. [3]

    K., Janakiraman, R., and Xu, L

    Aguilera, M. K., Janakiraman, R., and Xu, L. Using erasure codes efficiently for storage in a distributed system. In 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 336–345. IEEE,

  4. [4]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  5. [5]

    H., Zhang, B., Solomon, E

    Coppock, P. H., Zhang, B., Solomon, E. H., Kypriotis, V ., Yang, L., Sharma, B., Schatzberg, D., Mowry, T., and Skarlatos, D. Lithos: An operating system for efficient machine learning on gpus.ArXiv, abs/2504.15465,

  6. [6]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Dao, T. Flashattention-2: Faster attention with better par- allelism and work partitioning.ArXiv, abs/2307.08691,

  7. [7]

    G., and Yang, J

    Ganguly, D., Melhem, R. G., and Yang, J. An adaptive framework for oversubscription management in cpu-gpu unified memory.2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1212–1217,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URL https://cloud. google.com/blog/products/compute/ introducing-trillium-6th-gen-tpus. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  9. [9]

    Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

    He, S., Cai, W., Huang, J., and Li, A. Capacity-aware inference: Mitigating the straggler effect in mixture of experts.ArXiv, abs/2503.05066,

  10. [10]

    Training Compute-Optimal Large Language Models

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models.ArXiv, ab...

  11. [11]

    Demystifying cost- efficiency in llm serving over heterogeneous gpus.ArXiv, abs/2502.00722,

    Jiang, Y ., Fu, F., Yao, X., He, G., Miao, X., Klimovic, A., Cui, B., Yuan, B., and Yoneki, E. Demystifying cost- efficiency in llm serving over heterogeneous gpus.ArXiv, abs/2502.00722,

  12. [12]

    Scaling Laws for Neural Language Models

    GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving Kaplan, J., McCandlish, S., Henighan, T. J., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. ArXiv, abs/2001.08361,

  13. [13]

    Revisiting reliability in large-scale machine learning research clusters.2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp

    Kokolis, A., Kuchnik, M., Hoffman, J., Kumar, A., Malani, P., Ma, F., DeVito, Z., Sengupta, S., Saladi, K., and Wu, C.-J. Revisiting reliability in large-scale machine learning research clusters.2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1259–1274,

  14. [14]

    Understanding stragglers in large model training using what-if analysis

    Lin, J., Jiang, Z., Song, Z., Zhao, S., Yu, M., Wang, Z., Wang, C., Shi, Z., Shi, X., Jia, W., Liu, Z., Wang, S., Lin, H., Liu, X., Panda, A., and Li, J. Understanding stragglers in large model training using what-if analysis. ArXiv, abs/2505.05713,

  15. [15]

    Enhancing relia- bility in ai inference services: An empirical study on real production incidents.ArXiv, abs/2511.07424,

    Ranganathan, B., Zhang, M., and Wu, K. Enhancing relia- bility in ai inference services: An empirical study on real production incidents.ArXiv, abs/2511.07424,

  16. [16]

    S., Narang, S., Edunov, S., Naumov, M., Tang, C., and Oldham, M

    Salpekar, O., Varma, R., Yu, K., Ivanov, V ., Wang, Y ., Sharif, A., Si, M., Xu, S., Tian, F., Zheng, S., Rice, T., Garg, A., Peng, S., Siravara, S., Fu, W., de Castro, R., Gangidi, A., Obraztsov, A. S., Narang, S., Edunov, S., Naumov, M., Tang, C., and Oldham, M. Training llms with fault tolerant hsdp on 100,000 gpus.ArXiv, abs/2602.00277,

  17. [17]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi- billion parameter language models using model paral- lelism.arXiv preprint arXiv:1909.08053,

  18. [18]

    2024 crowdstrike-related it outages

    Wikipedia. 2024 crowdstrike-related it outages. https://en.wikipedia.org/wiki/2024_ CrowdStrike-related_IT_outages,

  19. [19]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Wolf, T., Debut, L., Sanh, V ., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. Huggingface’s transformers: State-of-the- art natural language processing.ArXiv, abs/1910.03771,

  20. [20]

    Fast distributed inference serving for large language models,

    Wu, B., Zhong, Y ., Zhang, Z., Huang, G., Liu, X., and Jin, X. Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920,

  21. [21]

    G., Reizenstein, J., Park, J., and Huang, J

    Yang, A., Yang, J., Ibrahim, A., Xie, X., Tang, B., Sizov, G. G., Reizenstein, J., Park, J., and Huang, J. Context parallelism for scalable million-token inference.ArXiv, abs/2411.01783,

  22. [22]

    Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

    Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y ., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., and Ceze, L. Flashinfer: Efficient and customizable attention engine for llm inference serving.ArXiv, abs/2501.01005,

  23. [23]

    Alisa: Accelerating large language model inference via sparsity-aware kv caching

    Zhao, Y ., Wu, D., and Wang, J. Alisa: Accelerating large language model inference via sparsity-aware kv caching. ArXiv, abs/2403.17312,

  24. [24]

    Sam- pleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention

    Zhu, Q., Duan, J., Chen, C., Liu, S., Li, X., Feng, G., Lv, X., Cao, H., Xiao, C., Zhang, X., Lin, D., and Yang, C. Sam- pleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention. ArXiv, abs/2406.15486, 2024