pith. sign in

arxiv: 2606.20537 · v1 · pith:5Z7OJXXAnew · submitted 2026-06-18 · 💻 cs.LG · cs.DC

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

Pith reviewed 2026-06-26 17:37 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords execution-state capsulescheckpoint restoreLLM servingon-device inferenceKV cachegraph capturerecurrent statephysical-AI
0
0 comments X

The pith

Execution-state capsules checkpoint and restore the full LLM execution state at graph-bound boundaries for sub-millisecond reuse in low-latency on-device serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mainstream serving reuses only KV-cache fragments for high-throughput workloads. This paper targets the complementary regime of small-batch, low-latency, on-device physical-AI agents that must branch, reset, and re-enter under tight budgets. It defines execution-state capsules as a complete checkpoint-restore primitive that captures every named buffer at committed execution boundaries. The mechanism runs on captured graph plans over contiguous static buffers, so it includes recurrent state, convolution state, and metadata in addition to KV. Restores remain byte-exact and token-identical under greedy decode while delivering TTFT speedups that scale with context length.

Core claim

FlashRT's execution-state capsules snapshot, restore, fork, or roll back the entire execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. On an RTX 5090 the restore is byte-exact at the stored-state level and token-identical under greedy decode; GPU-resident snapshot and restore complete in sub-milliseconds. TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. A KV-only ablation diverges, showing recurrent state is load-bearing. The same properties hold on Jetson AGX Thor and DGX Spark.

What carries the argument

execution-state capsules: graph-bound checkpoint and restore for the complete restorable state at a committed boundary, implemented by running captured graph plans over contiguous static buffers with no block-table indirection.

If this is right

  • Capsule restore is byte-exact at the stored-state level and token-identical under greedy decode.
  • GPU-resident snapshot and restore complete in sub-milliseconds.
  • TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens.
  • A KV-only ablation diverges, confirming that recurrent state must be included.
  • The same correctness and structural properties hold on Jetson AGX Thor and DGX Spark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Capsules could support repeated branching inside robotic control loops without recomputing prior execution prefixes.
  • The closed-buffer assumption may restrict direct use with models that rely on dynamic memory allocation during inference.
  • Hybrid schedulers could route latency-critical interactive paths through capsules while routing bulk throughput through conventional KV caches.

Load-bearing premise

The live execution state forms a closed set of named buffers that can be managed via captured graph plans over contiguous static buffers with no block-table indirection.

What would settle it

A restore operation that produces a different token sequence than re-execution from the same starting point under identical greedy decoding, or a measured GPU-resident restore latency exceeding the reported sub-millisecond range on the tested hardware.

Figures

Figures reproduced from arXiv: 2606.20537 by Liang Su.

Figure 1
Figure 1. Figure 1: What each system manages, drawn concretely. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The serving verbs as operations on the buffer set (cf. Algorithm 1). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Physical-AI serving scenarios and the capsule (cf. the verbs in Fig. 2). [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Runtime floor and state-reuse floor, identical model/GPU, single-stream. FlashRT [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Embodied loop, single-stream, under comparable [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM+TTS barge-in, single-stream, composed latency: measured LLM persona re-entry + a separately measured fixed TTS first-audio term (94 ms), not a co-resident end-to-end measurement. After an interrupt, capsule (restore the pinned persona) cuts the LLM re-entry vs naive (re-prefill); the TTS term is identical for both, so the capsule is the 2.02× difference. low-overhead execution and tight chunking, not a… view at source ↗
Figure 7
Figure 7. Figure 7: Higgs TTS, RTX 5090, single-stream (concurrency [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-device, single-stream, paper-TTFT convention (first base-logit token, MTP tail [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Embodied working set across the three devices, single-stream: revisit TTFT as [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces execution-state capsules, a graph-bound checkpoint/restore mechanism for the complete restorable execution state (KV, recurrent, convolution, MTP, metadata) at committed boundaries in low-latency small-batch on-device LLM serving. FlashRT is presented as a white-box runtime whose NVIDIA CUDA backend uses captured graph plans over contiguous static buffers with no block-table indirection, enabling byte-exact snapshot/restore, token-identical greedy outputs, sub-millisecond GPU-resident operations, and TTFT speedups of 3.9x (2k tokens) to 27x (16k tokens) over cold prefill. A KV-only ablation is reported to diverge, and the same properties are claimed to hold on Jetson AGX Thor and DGX Spark hardware.

Significance. If the central mechanism and empirical claims hold, the work defines a latency-first complementary point to paged KV-cache serving for interactive physical-AI workloads that require frequent branching, reset, and re-entry. The multi-hardware evaluation and the KV-only ablation that isolates recurrent state as load-bearing are concrete strengths; the absence of fitted parameters or self-referential derivations is appropriate for an empirical runtime contribution.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'the live state is a closed set of named buffers' enabling complete byte-exact snapshot/restore rests on the unverified assertion that all components (KV, recurrent, convolution, MTP, metadata) are captured by static graph plans with no dynamic allocation or indirection; without an explicit enumeration of these buffers or the capture procedure, the completeness guarantee and the interpretation of the KV-only ablation cannot be assessed.
  2. [Abstract] Abstract: concrete performance numbers (sub-millisecond restore, 3.9x–27x TTFT speedup) and correctness properties (byte-exact at stored-state level, token-identical under greedy decode) are reported without methods details, number of trials, error bars, or raw data, which is load-bearing for the empirical support of the speedup claims across hardware platforms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and positive assessment of the work's significance for latency-first on-device serving. We respond to each major comment below, committing to revisions that add the requested details for verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the live state is a closed set of named buffers' enabling complete byte-exact snapshot/restore rests on the unverified assertion that all components (KV, recurrent, convolution, MTP, metadata) are captured by static graph plans with no dynamic allocation or indirection; without an explicit enumeration of these buffers or the capture procedure, the completeness guarantee and the interpretation of the KV-only ablation cannot be assessed.

    Authors: We agree that the abstract's brevity leaves the buffer set and capture procedure implicit. In revision we will add a dedicated Methods subsection 'Buffer Enumeration and Graph Capture Procedure' that explicitly enumerates every restorable named buffer (per-layer KV tensors, recurrent/SSM hidden states, convolution buffers, MTP states, and metadata) together with the exact sequence of CUDA graph captures that enforce contiguous static allocation and eliminate dynamic allocation or block-table indirection. This addition will directly substantiate the closed-set claim and clarify why the KV-only ablation diverges. revision: yes

  2. Referee: [Abstract] Abstract: concrete performance numbers (sub-millisecond restore, 3.9x–27x TTFT speedup) and correctness properties (byte-exact at stored-state level, token-identical under greedy decode) are reported without methods details, number of trials, error bars, or raw data, which is load-bearing for the empirical support of the speedup claims across hardware platforms.

    Authors: We acknowledge that the reported numbers and correctness properties require fuller methodological support. The revised manuscript will expand the Experiments section with: (i) the precise measurement protocol (CUDA event timing with full synchronization barriers), (ii) the number of independent trials per configuration (minimum 50) together with standard-deviation error bars, (iii) per-platform configuration details for RTX 5090, Jetson AGX Thor, and DGX Spark, and (iv) a statement that raw timing and output logs will be released in a public repository. These changes will strengthen the empirical grounding of the speedup and correctness claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with no derivations or self-referential reductions

full rationale

The paper introduces a runtime mechanism (execution-state capsules via FlashRT) for checkpoint/restore of LLM execution state and evaluates it empirically on NVIDIA hardware with reported TTFT speedups, byte-exact restores, and KV-only ablations. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. The central premise (state as closed set of named static buffers with captured graphs) is a design assumption enabling the mechanism, not a reduction to prior results or self-definition. Claims rest on direct measurement rather than any load-bearing derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution is a new system mechanism with no fitted parameters or invented physical entities; it rests on domain assumptions about GPU graph capture and static buffers.

axioms (1)
  • domain assumption GPU execution state at committed boundaries can be represented as a closed set of named buffers managed via captured static graph plans
    Invoked in the description of FlashRT and capsule operation.
invented entities (1)
  • execution-state capsule no independent evidence
    purpose: Checkpoint and restore mechanism for full graph-bound execution state
    New construct introduced to enable the described reuse; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5849 in / 1366 out tokens · 41944 ms · 2026-06-26T17:37:44.179284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

    cs.RO 2026-07 unverdicted novelty 7.0

    Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM b...

Reference graph

Works this paper leans on

15 extracted references · 7 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

  2. [2]

    CRIU: Checkpoint/restore in userspace

    CRIU Project. CRIU: Checkpoint/restore in userspace. GitHub repository, https://github.com/ checkpoint-restore/criu, 2024

  3. [3]

    Prompt cache: Modular attention reuse for low-latency inference

    In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. InProceedings of Machine Learning and Systems (MLSys), 2024.https://arxiv.org/abs/2311.04934

  4. [4]

    Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  5. [5]

    Fu, Christopher Ré, and Azalia Mirhoseini

    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. Hydragen: High-throughput LLM inference with shared prefixes.arXiv preprint arXiv:2402.05099, 2024

  6. [6]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023.https://arxiv.org/abs/2309.06180

  7. [7]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning (ICML), 2023

  8. [8]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  9. [9]

    CUDA C++ programming guide: CUDA graphs

    NVIDIA. CUDA C++ programming guide: CUDA graphs. https://docs.nvidia.com/cuda/ cuda-c-programming-guide/, 2024

  10. [10]

    Physical Intelligence, Kevin Black, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  11. [11]

    vAttention: Dynamic memory management for serving LLMs without PagedAttention

    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. vAttention: Dynamic memory management for serving LLMs without PagedAttention. InProceedings of the 30th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025.https://arxiv.org/abs/2405.04437

  12. [12]

    FlashRT: A white-box kernel-level inference runtime

    Liang Su. FlashRT: A white-box kernel-level inference runtime. https://github.com/ flashrt-project/FlashRT, 2026

  13. [13]

    Gated delta networks: Improving Mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  14. [14]

    Stateful large language model serving with Pensieve

    Lingfan Yu, Jinkun Lin, and Jinyang Li. Stateful large language model serving with Pensieve. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys), 2025.https: //arxiv.org/abs/2312.05516

  15. [15]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024.https://arxiv.org/abs/2312.07104. 26