pith. machine review for the scientific record. sign in

arxiv: 2604.28138 · v1 · submitted 2026-04-30 · 💻 cs.OS · cs.AI

Recognition: unknown

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Chaokun Chang, Lunxi Cao, Tianyuan Wu, Wei Gao, Wei Wang

Pith reviewed 2026-05-07 05:53 UTC · model grok-4.3

classification 💻 cs.OS cs.AI
keywords checkpoint restoreagent sandboxeseBPFsemantics-awarefault tolerancecontainer runtimeLLM agents
0
0 comments X

The pith

Crab uses an eBPF inspector to classify OS effects per agent turn and checkpoint only the recovery-relevant ones, reaching 100% correctness while cutting traffic by up to 87%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that agent sandboxes suffer from a semantic gap: agent code sees tool calls but not their full OS consequences, while the OS sees state changes but cannot tell which ones matter for later recovery. Crab fills this gap with a transparent host-side runtime that inspects each turn's effects and decides checkpoint granularity accordingly. It aligns those checkpoints with turn boundaries, overlaps the work with LLM wait time, and schedules traffic across co-located sandboxes. The approach rests on the observation that more than 75% of turns produce no recovery-relevant state. If the method works, agent systems can gain reliable fault tolerance, spot execution, and rollback without paying the cost of full per-turn checkpointing.

Core claim

Crab is a transparent host-side runtime that bridges the agent-OS semantic gap without changing agents or C/R backends. An eBPF-based inspector classifies every turn's OS-visible effects as recovery-relevant or irrelevant. A coordinator aligns checkpoints with turn boundaries and overlaps C/R work with LLM wait time. A host-scoped engine then schedules the resulting traffic across co-located sandboxes. On shell-intensive and code-repair workloads this yields 100% recovery correctness, up to 87% less checkpoint traffic, and execution time within 1.9% of the fault-free baseline.

What carries the argument

The eBPF-based inspector that classifies each turn's OS-visible effects as recovery-relevant or irrelevant, then drives selective checkpoint decisions at turn boundaries.

Load-bearing premise

The eBPF inspector can accurately and cheaply label every turn's OS effects as relevant or irrelevant for recovery without missing critical state or adding measurable overhead.

What would settle it

A workload in which an agent turn alters a file, process, or other OS state that the inspector labels irrelevant, yet that state proves necessary for correct restoration after a fault.

Figures

Figures reproduced from arXiv: 2604.28138 by Chaokun Chang, Lunxi Cao, Tianyuan Wu, Wei Gao, Wei Wang.

Figure 1
Figure 1. Figure 1: Recovery-state gap under four recovery strategies. Left: task solve rate. Right: CDF of per-task runtime. Slow￾down is defined as median restart / no failure solve time. To quantify this gap, we randomly sample 50 Terminal￾Bench [31] tasks that succeed in a clean run using the iFlow￾ROME [46] model, inject exactly one failure at a random position along each trajectory, and compare three recovery strategies… view at source ↗
Figure 2
Figure 2. Figure 2: Agent turn-time distribution and checkpoint pres￾sure at host scale. Left: CDF of agent turn time. Right: host￾side checkpoint arrival RPS distribution vs sandbox density. 0 6 12 18 24 Snapshot Time (ms) Modify 100MB Create 1000 4K files Create 100-level directory tree Mixed 100 Concurrent Ops 11.8 13.7 19.4 21.8 24 8 16 32 64 Concurrency 10 3 10 4 Ckpt. Time (ms) 7.8 s 12.7 s 23.8 s 47.4 s 128 MB 256 MB 5… view at source ↗
Figure 3
Figure 3. Figure 3: Left: ZFS snapshot overhead remains within tens of ms. Right: CRIU-based process checkpoint latency can grow to tens of seconds under high concurrency; Testbed: AWS c6id.32xlarge [1] with local NVMe SSDs. runc+CRIU [38] for process state—the overhead could reach seconds, dominated by process checkpointing. The cost compounds at host scale. Agents generate fre￾quent checkpoint opportunities: as view at source ↗
Figure 5
Figure 5. Figure 5: Architecture overview of Crab; yellow blocks are key components of Crab. Agent & Sbx Turn 𝑖 Wait for LLM Wait Turn 𝑖+1 Coordinator Wait Scheduler Workers Checkpoint Dump Manager Checkpoint Lifecycle Tracking Inspector LLM Service LLM Thinking & Response Generation C/R Engine (1) Intercept (2) Forward (3) Query (4) Submit & Enqueue (5) Schedule (6) Intercept&Wait Decide (7) Forward TimeCRAB Other sandboxes’… view at source ↗
Figure 6
Figure 6. Figure 6: Workflow timeline of Crab. architecture. Crab has three key components that jointly realize all three design rationales to meet R1–R3. Coordinator. The Coordinator sits on the control path be￾tween the agent and the external LLM service. It identifies turn boundaries by intercepting requests and responses, and uses them to trigger checkpoint decisions at a precise consis￾tency point—realizing asynchronous … view at source ↗
Figure 7
Figure 7. Figure 7: Examples of net filesystem / process changes. Asynchronous Checkpoint Dispatch (R2). Immediately after forwarding the outbound request, the Coordinator queries the Inspector (§5.2) for the turn’s recovery-relevant state changes and, if checkpointing is needed, submits a checkpoint job to the C/R Engine (§5.3). Because the LLM has not yet responded, this checkpoint work proceeds con￾currently with the LLM i… view at source ↗
Figure 8
Figure 8. Figure 8: The Checkpoint Manager maintains versioned recoverable manifests over partial filesystem and process checkpoints (left), and enforces transactional checkpoint publication through lifecycle tracking (right). request progresses through a lifecycle: “pending” upon ar￾rival, “dumping” once assigned to a Worker, then “versioning” where the Manager combines the new artifact with the latest compatible counterpart… view at source ↗
Figure 11
Figure 11. Figure 11: LLM and tool execution duration distributions. unrelated to checkpointing and add noise to the measure￾ments. We then record the full agent–LLM interaction trace for each task, including model outputs, reasoning traces, and tool invocations. Our experiments replay these recorded traces deterministically. This removes LLM stochasticity and isolates the effect of checkpointing, restore, and scheduling decis… view at source ↗
Figure 10
Figure 10. Figure 10: Task-category composition of the sampled SWE￾Bench and Terminal-Bench workloads. (case studies): Can Crab also enhance agentic execution and enable proactive rollback, spot execution, speculative execution, and RL rollout branching? (§7.5) 7.1 Experimental Setup Host Platform. We evaluate Crab on an Amazon AWS c6id.32xlarge [1] instance with 128 Intel Xeon Platinum 8375C cores, 256 GB of memory, and 4 × 1… view at source ↗
Figure 12
Figure 12. Figure 12: Recovery correctness under sandbox crashes. Exact Detected Acc.↑ FPR↓ FNR↓ Proc. Change 4.8% (99) 4.8% (99) 100.0% 0.0% 0.0% FS Change 27.9% (575) 29.5% (609) 98.3% 2.3% 0.0% view at source ↗
Figure 14
Figure 14. Figure 14: Per-turn Coordinator overhead. Even at 96-way co-location, the Coordinator adds only tens of microseconds per turn, 4–5 orders of magnitude smaller than turn latency. Restart is substantially slower, increasing completion time by up to 1.52× on Terminal-Bench and 1.67× on SWE-bench, because a random crash wastes roughly half of the prior work in expectation and also incurs sandbox restart over￾head. FullC… view at source ↗
Figure 15
Figure 15. Figure 15: End-to-end performance comparison across benchmarks and deployment densities under a crash recovery scenario for Claude-code, iFlow-cli, and SWE-agent. Crab remains within 1.9% of the no-fault, checkpoint-free optimal. 10 2 10 3 Duration (ms) 0.0 0.5 1.0 CDF Agent-in-a-Sandbox 16: 54ms 32: 55ms 64: 65ms 96: 72ms 10 2 Duration (ms) 0.0 0.5 1.0 Agent-with-a-Sandbox 16: 31ms 32: 33ms 64: 45ms 96: 47ms Densit… view at source ↗
Figure 16
Figure 16. Figure 16: CDF of the Inspector’s per-turn overhead across deployment densities for agent-in/with-a-sandbox. the median overhead is 18/16 µs for agent-in/with-a-sandbox deployments, and the p95 is 40/27 µs. Compared with LLM and tool latency ( view at source ↗
Figure 18
Figure 18. Figure 18: Effectiveness of asynchronous checkpointing and reactive scheduling. Left: task-level exposed delay under different sandbox densities. Right: task-level exposed delay under reduced LLM wait windows and different scheduling policies. The stress test scales the original LLM wait window to 0.2×/0.4×/0.6× of its original distribution at density 96. 0 50 100 150 200 250 300 350 400 w/ C/R tool Baseline 434s 30… view at source ↗
Figure 19
Figure 19. Figure 19: Proactive-rollback case studies: baseline vs. an agent equipped with a rollback() tool. A: QEMU startup, 434 s→307 s (−29%); RB2/RB4 stalls (orange) can be avoided. B: financial document classification, −3% wall clock but −36% rollback tokens. the CDF from 16 to 96 co-located sandboxes per host. Asyn￾chronous checkpointing (§5.1) hides nearly all checkpoint cost: the median exposed delay is zero across al… view at source ↗
Figure 21
Figure 21. Figure 21: Speculative execution on SWE-Bench tasks. Left: Per-task wall-clock w/ and w/o speculation (observed). Right: Per-task penalty CDF (i.e., extra time the agent stalls due to rejected drafts.) oracle model’s inference time. If they differ, Crab discards the fork and executes the oracle action on the main sandbox. This use case requires efficient sandbox fork and rollback. We evaluate this design on SWE-Benc… view at source ↗
read the original abstract

Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees state changes but lacks turn-level context to judge recovery relevance. This gap hides massive sparsity: over 75% of agent turns produce no recovery-relevant state, so most checkpoints are unnecessary. Crab (Checkpoint-and-Restore for Agent SandBoxes) is a transparent host-side runtime that bridges this gap without modifying agents or C/R backends. An eBPF-based inspector classifies each turn's OS-visible effects to decide checkpoint granularity; a coordinator aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time; and a host-scoped engine schedules checkpoint traffic across co-located sandboxes. On shell-intensive and code-repair workloads, Crab raises recovery correctness from 8% (chat-only) to 100%, cuts checkpoint traffic by up to 87%, and stays within 1.9% of fault-free execution time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Crab, a transparent host-side runtime for semantics-aware checkpoint/restore in agent sandboxes. It bridges the agent-OS semantic gap via an eBPF-based inspector that classifies each turn's OS-visible effects (files, processes, artifacts) as recovery-relevant or irrelevant, a coordinator that aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time, and a host-scoped engine for scheduling across co-located sandboxes. The central claim is that this exploits >75% sparsity in relevant turns to achieve 100% recovery correctness (vs. 8% for chat-only), up to 87% reduction in checkpoint traffic, and overhead within 1.9% of fault-free execution on shell-intensive and code-repair workloads, without modifying agents or C/R backends.

Significance. If the eBPF classification is shown to be accurate (no false negatives on recoverable state) and low-overhead, Crab would address a practical gap in fault tolerance, spot execution, and RL branching for containerized agents by avoiding unnecessary full checkpoints while preserving correctness. The approach is novel in its turn-level semantic filtering and could scale to dense co-location scenarios; the empirical gains on the reported workloads, if reproducible with proper controls, would be a useful contribution to OS support for agent systems.

major comments (3)
  1. [§3.2] §3.2 (eBPF Inspector): The claim of 100% recovery correctness rests on the inspector never producing false negatives when classifying recovery-relevant state mutations. No validation against ground-truth state diffs (e.g., full filesystem/process diffs per turn) or false-negative rate analysis is described; without this, it is impossible to confirm that the 75% sparsity figure reflects measured rather than assumed irrelevance.
  2. [§5] §5 (Evaluation): Concrete performance numbers (100% correctness, 87% traffic reduction, 1.9% overhead, 75% sparsity) are reported for two workloads without error bars, precise workload definitions, baseline source code or configurations, or micro-benchmarks isolating inspector overhead. This makes it difficult to assess whether the data support the central claims or whether the eBPF component adds measurable cost.
  3. [§4] §4 (Coordinator and Relevance Definition): The notion of 'recovery-relevant' state is used to justify sparsity and checkpoint decisions but lacks a formal definition or enumeration of what constitutes recoverable artifacts (e.g., specific file types, process trees, or runtime state). This is load-bearing for both the correctness and traffic-reduction results.
minor comments (2)
  1. [Abstract] Abstract and §1: The 75% sparsity figure is stated as a fact but should be explicitly tied to a measurement section or figure so readers can trace its origin.
  2. [§5] Figures in §5: Ensure checkpoint traffic and correctness plots include legends, axis units, and comparison baselines (chat-only and full C/R) for immediate readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments identify areas where additional rigor will strengthen the manuscript, and we will revise accordingly. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (eBPF Inspector): The claim of 100% recovery correctness rests on the inspector never producing false negatives when classifying recovery-relevant state mutations. No validation against ground-truth state diffs (e.g., full filesystem/process diffs per turn) or false-negative rate analysis is described; without this, it is impossible to confirm that the 75% sparsity figure reflects measured rather than assumed irrelevance.

    Authors: We acknowledge that the manuscript does not include an explicit empirical validation of the inspector against ground-truth state diffs. The design of the eBPF rules is intentionally conservative (any mutation that could affect subsequent turns is classified as relevant), but we agree this does not substitute for measured evidence. In the revised manuscript we will add a new subsection to §3.2 that reports a ground-truth comparison: for a representative sample of turns we collect full filesystem and process diffs (via snapshots and /proc enumeration) and measure the false-negative rate of the inspector. We will also report the per-workload sparsity as directly measured from these traces rather than as an aggregate claim. revision: yes

  2. Referee: [§5] §5 (Evaluation): Concrete performance numbers (100% correctness, 87% traffic reduction, 1.9% overhead, 75% sparsity) are reported for two workloads without error bars, precise workload definitions, baseline source code or configurations, or micro-benchmarks isolating inspector overhead. This makes it difficult to assess whether the data support the central claims or whether the eBPF component adds measurable cost.

    Authors: We agree that the evaluation section would be improved by greater statistical detail and reproducibility information. In the revised §5 we will (1) add error bars derived from at least five independent runs per configuration, (2) provide precise workload definitions together with the exact agent prompts, container images, and baseline checkpointing scripts (including repository links), and (3) insert micro-benchmark results that isolate the eBPF inspector overhead by comparing runs with the inspector enabled versus disabled while holding all other components constant. revision: yes

  3. Referee: [§4] §4 (Coordinator and Relevance Definition): The notion of 'recovery-relevant' state is used to justify sparsity and checkpoint decisions but lacks a formal definition or enumeration of what constitutes recoverable artifacts (e.g., specific file types, process trees, or runtime state). This is load-bearing for both the correctness and traffic-reduction results.

    Authors: We accept that a formal definition is required. In the revised manuscript we will expand the opening of §4 to state: recovery-relevant state is any OS-visible artifact whose mutation persists past the current turn boundary and whose absence would prevent correct restoration to that boundary. We will then enumerate the four categories used by the inspector: (i) non-temporary file-system modifications, (ii) process creations that leave open resources or children, (iii) network sockets and external file descriptors, and (iv) selected runtime artifacts (environment variables, shared-memory segments, and open file handles). This definition directly drives both the sparsity measurement and the coordinator's checkpoint decisions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system results rest on direct measurements, not self-referential derivations

full rationale

The paper describes an implemented runtime (eBPF inspector, coordinator, host-scoped engine) and reports measured outcomes on workloads: recovery correctness rising from 8% to 100%, checkpoint traffic cut by up to 87%, and overhead bounded at 1.9%. These quantities are obtained by running the described components and observing end-to-end behavior; they are not obtained by fitting parameters to a subset of the same data and then re-deriving the same quantities, nor by any equation that takes the target metric as an input. No self-definitional relations, fitted-input predictions, or load-bearing self-citations that collapse the central argument appear in the abstract or the described approach. The 75% sparsity observation is presented as an empirical finding that motivates the design rather than an assumption smuggled into the evaluation. The eBPF classification accuracy is a correctness assumption whose verification is external to any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or ad-hoc axioms are stated. The approach implicitly relies on the domain assumption that eBPF can observe all relevant OS effects transparently.

axioms (2)
  • domain assumption eBPF provides complete, low-overhead visibility into OS state changes caused by agent turns without requiring agent or sandbox modifications
    This assumption underpins the transparent inspector and is required for the claimed correctness and overhead numbers.
  • domain assumption Agent turns can be reliably aligned with OS-visible effects at the host level
    Required for the coordinator to decide checkpoint granularity without false negatives.

pith-pipeline@v0.9.0 · 5555 in / 1461 out tokens · 42840 ms · 2026-05-07T05:53:00.109464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 20 canonical work pages · 3 internal anchors

  1. [1]

    Amazon Web Services. 2026. Amazon EC2 C6id Instances.https: //aws.amazon.com/ec2/instance-types/c6i/. Accessed 2026-03-26

  2. [2]

    Amazon Web Services. 2026. Improving startup performance with AWS Lambda SnapStart.https://docs.aws.amazon.com/lambda/latest/ dg/snapstart.html. Accessed 2026-03-26

  3. [3]

    Amazon Web Services. 2026. Spot Instance interruption notices (Ama- zon EC2 User Guide).https://docs.aws.amazon.com/AWSEC2/lates t/UserGuide/spot-instance-termination-notices.htmlDocuments that interruption notices are issued two minutes before interruption. Accessed: 2026-03-16

  4. [4]

    Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Trans- parent Checkpointing for Cluster Computations and the Desktop. InProceedings of the IEEE International Parallel and Distributed Pro- cessing Symposium (IPDPS). IEEE, Rome, Italy, 12 pages.https: //people.csail.mit.edu/jansel/papers/2009ipdps-dmtcp.pdf

  5. [5]

    Anthropic. 2025. Multi-Agent Research System.https://www.anthro pic.com/engineering/multi-agent-research-system. 2026-04-01

  6. [6]

    Anthropic. 2026. Checkpointing (Claude Code Docs).https://code.cla ude.com/docs/en/checkpointingAccessed: 2026-03-16

  7. [7]

    Anthropic. 2026. Claude 4.6.https://www.anthropic.com/claude. Accessed 2026-03-26

  8. [8]

    Anthropic. 2026. Claude Code Overview (Claude Code Docs).https: //code.claude.com/docs/en/overviewAccessed: 2026-03-16

  9. [9]

    Lixiang Ao, George Porter, and Geoffrey M Voelker. 2022. Faasnap: Faas made fast using snapshot-based vms. InProceedings of the Seventeenth European Conference on Computer Systems. ACM, Rennes, France, 730– 746

  10. [10]

    Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTI: High performance fault tolerance interface for hybrid systems. InProceed- ings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, Seattle, WA, USA, 1–32

  11. [11]

    James Cadden, Thomas Unger, Yara Awad, Han Dong, Orran Krieger, and Jonathan Appavoo. 2020. SEUSS: Skip Redundant Paths to Make Serverless Fast. InEuropean Conference on Computer Systems (EuroSys). ACM, New York, NY, USA, 1 – 15. doi:10.1145/3342195.3392698

  12. [12]

    Yang Chen. 2015. Checkpoint and restore of micro-service in docker containers. In2015 3rd International Conference on Mechatronics and Industrial Informatics (ICMII 2015). Atlantis Press, Zhuhai, China, 915– 918

  13. [13]

    CRIU Project. 2025. CRIU: Checkpoint/Restore In Userspace (Main Page).https://criu.org/Main_PageDescribes freezing containers/apps and checkpointing state to disk. Accessed: 2026-03-16

  14. [14]

    Xinhao Deng, Yixiang Zhang, Jiaqing Wu, Jiaqi Bai, Sibo Yi, Zhuoheng Zou, Yue Xiao, Rennai Qiu, Jianan Ma, Jialuo Chen, Xiaohu Du, Xiao- fang Yang, Shiwen Cui, Changhua Meng, Weiqiang Wang, Jiaxing Song, Ke Xu, and Qi Li. 2026. Taming OpenClaw: Security Analysis and Mit- igation of Autonomous LLM Agent Threats. arXiv:2603.11619 [cs.CR] https://arxiv.org/a...

  15. [15]

    Dong Du, Tianyi Yu, Yubin Xia, Binyu Zang, Guanglu Yan, Chenggang Qin, Qixuan Wu, and Haibo Chen. 2020. Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting. In International Conference on Architectural Support for Programming Lan- guages and Operating Systems (ASPLOS). ACM, Lausanne, Switzerland, 467–481. doi:10.11...

  16. [16]

    E2B. 2026. E2B Documentation.https://e2b.dev/docsDescribes isolated sandboxes for agents to execute code, process data, and run tools. Accessed: 2026-03-16

  17. [17]

    E2B. 2026. Sandbox Snapshots.https://e2b.dev/docs/sandbox/snaps hots. Accessed 2026-03-17

  18. [18]

    Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. Check-N-Run: A Check- pointing System for Training Deep Learning Recommendation Models. In19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI). USENIX Association, Rento...

  19. [19]

    Firecracker. 2026. Firecracker Snapshotting.https://github.com/firec racker-microvm/firecracker/blob/main/docs/snapshotting/snapshot- support.md. Accessed 2026-03-17

  20. [20]

    Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, and Wei Wang. 2025. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure. arXiv:2512.22560 [cs.DC]https: //arxiv.org/abs/2512.22560

  21. [21]

    Paul H Hargrove and Jason C Duell. 2006. Berkeley lab check- point/restart (blcr) for linux clusters.Journal of Physics: Conference Series46, 1 (2006), 494–499

  22. [22]

    Jialiang Huang, Mingxing Zhang, Teng Ma, Zheng Liu, Sixing Lin, Kang Chen, Jinlei Jiang, Xia Liao, Yingdi Shan, Ning Zhang, Mengting Lu, Tao Ma, Haifeng Gong, and Yongwei Wu. 2024. TrEnv: Trans- parently Share Serverless Execution Environments Across Different Functions and Nodes. InACM SIGOPS Symposium on Operating Sys- tems Principles (SOSP). ACM, Hilto...

  23. [23]

    iFlow-cli. 2026. iFlow-cli.https://cli.iflow.cn/. Accessed 2026-03-26

  24. [24]

    illumos Project. 2026. Working With ZFS Snapshots and Clones (ZFS Administration Guide).https://www.illumos.org/books/zfs-admi n/snapshots.htmlDefines ZFS snapshots as read-only, near-instant, copy-on-write point-in-time filesystem copies. Accessed: 2026-03-16

  25. [25]

    Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. 2026. Tree Search for LLM Agent Reinforcement Learning. arXiv:2509.21240 [cs.LG]https://arxiv.org/abs/2509.21240

  26. [26]

    LangChain. 2026. Persistence (LangGraph Docs).https://docs.langcha in.com/oss/python/langgraph/persistenceDescribes checkpointing of graph state at every step, threads, and time-travel debugging. Accessed: 2026-03-16

  27. [27]

    Linux Kernel Developers. 2026. Control Groups (cgroups) v2 (Linux Kernel Documentation).https://docs.kernel.org/admin-guide/cgroup- v2.htmlDescribes the unified cgroup hierarchy for resource control and process grouping. Accessed: 2026-03-16

  28. [28]

    Linux Kernel Developers. 2026. eBPF Syscall (Linux Kernel Documen- tation).https://docs.kernel.org/userspace-api/ebpf/syscall.html Reference for loading/attaching eBPF programs (including tracing/c- group attachment points). Accessed: 2026-03-16

  29. [29]

    Linux Kernel Developers. 2026. Soft-Dirty PTEs (Linux Kernel Doc- umentation).https://www.kernel.org/doc/html/v5.4/admin- guide/mm/soft-dirty.htmlDescribes soft-dirty tracking. Accessed: 2026-03-16

  30. [30]

    LlamaIndex. 2026. Maintaining state (LlamaIndex Docs).https://deve lopers.llamaindex.ai/python/framework/understanding/agent/state/ Explains that AgentWorkflow is stateless by default and uses Context to maintain state across runs. Accessed: 2026-03-16

  31. [31]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al . 2026. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces. arXiv:2601.11868 [cs.AI]

  32. [32]

    Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. SpotServe: Serving Generative Large Lan- guage Models on Preemptible Instances. InProceedings of the 29th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 2(La Jolla, CA, USA) (ASPLOS ’24). Association...

  33. [33]

    Microsoft. 2026. Managing State (AutoGen Docs).https://microsof t.github.io/autogen/stable/user-guide/agentchat-user-guide/tuto 14 rial/state.htmlOfficial tutorial on saving/loading agent/team state. Accessed: 2026-03-16

  34. [34]

    MiniMax. 2026. MiniMax M2.7.https://www.minimaxi.com/models/ text/m27. Accessed 2026-03-26

  35. [35]

    Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. In19th USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Virtual Event, 203–216.https://www.usenix.org/confere nce/fast21/presentation/mohan

  36. [36]

    Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. InSC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, 1–11

  37. [37]

    Bogdan Nicolae, Adam Moody, Gregory Kosinovsky, Kathryn Mohror, and Franck Cappello. 2021. VELOC: VEry Low Overhead Checkpoint- ing in the Age of Exascale. arXiv:2103.02131.https://arxiv.org/abs/21 03.02131

  38. [38]

    Open Container Initiative. 2026. runc Checkpoint/Restore Documen- tation.https://github.com/opencontainers/runc/blob/main/docs/ch eckpoint-restore.mdDocuments runc checkpoint/restore integration with CRIU. Accessed: 2026-03-16

  39. [39]

    OpenAI. 2026. Codex Overview (OpenAI Codex Docs).https://open ai.com/codex/Accessed: 2026-03-16

  40. [40]

    OpenAI. 2026. Run Long-Horizon Tasks with Codex.https://develope rs.openai.com/blog/run-long-horizon-tasks-with-codex. 2026-04-01

  41. [41]

    OpenClaw. 2026. Sandboxing (OpenClaw Docs).https://docs.ope nclaw.ai/gateway/sandboxingExplains that gateway stays on host; tool execution can run in Docker sandbox; sandboxing reduces blast radius but is not a perfect boundary. Accessed: 2026-03-16

  42. [42]

    OpenHands. 2026. Docker Sandbox (OpenHands Docs).https://do cs.openhands.dev/openhands/usage/sandboxes/dockerStates that Docker sandbox runs the agent server inside a Docker container by default/recommended. Accessed: 2026-03-16

  43. [43]

    Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-Tree Filesystem.ACM Trans. Storage9, 3, Article 9 (Aug. 2013), 32 pages. doi:10.1145/2501620.2501623

  44. [44]

    Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, and Ana Klimovic. 2024. DéjàVu: KV-cache streaming for fast, fault- tolerant generative LLM serving. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1902, 27 pages

  45. [45]

    Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, and Edward Grefenstette. 2026. A Subgoal-driven Framework for Improving Long- Horizon LLM Agents. arXiv:2603.19685 [cs.AI]https://arxiv.org/abs/ 2603.19685

  46. [46]

    Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al

  47. [47]

    Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem.arXiv preprintarXiv:2512.24873, 2025

    Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem. arXiv:2512.24873 [cs.AI]https://arxiv.org/abs/2512.24873

  48. [48]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yan- jun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. Open- Hands: An Open Platform f...

  49. [49]

    Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. InACM SIGOPS Symposium on Operating Systems Principles (SOSP). ACM, Koblenz, Germany, 364–381. doi:10.1145/3600006.3613145

  50. [50]

    Fangnuo Wu, Mingkai Dong, Gequan Mo, and Haibo Chen. 2023. Treesls: A whole-system persistent microkernel with tree-structured state checkpoint on nvm. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 1–16

  51. [51]

    Hanafy, Tarek Abdelzaher, David Irwin, Jesse Milzman, and Prashant Shenoy

    Li Wu, Walid A. Hanafy, Tarek Abdelzaher, David Irwin, Jesse Milzman, and Prashant Shenoy. 2025. FailLite: Failure-Resilient Model Serving for Resource-Constrained Edge Environments. arXiv:2504.15856 [cs.DC]https://arxiv.org/abs/2504.15856

  52. [52]

    Wendong Xu, Chujie Chen, He Xiao, Kuan Li, Jing Xiong, Chen Zhang, Wenyong Zhou, Chaofan Tao, Yang Bai, Bei Yu, and Ngai Wong. 2025. AnchorTP: Resilient LLM Inference with State-Preserving Elastic Ten- sor Parallelism. arXiv:2511.11617 [cs.DC]https://arxiv.org/abs/2511.1 1617

  53. [53]

    Ziyi Xu, Zhiqiang Xie, Swapnil Gandhi, and Christos Kozyrakis. 2025. FailSafe: High-performance Resilient Serving. arXiv:2511.14116 [cs.DC]https://arxiv.org/abs/2511.14116

  54. [54]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent- computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  55. [55]

    Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. 2026. Speculative Actions: A Lossless Framework for Faster Agentic Systems. arXiv:2510.04371 [cs.AI] https://arxiv.org/abs/2510.04371

  56. [56]

    Songyu Zhang, Aaron Tam, Myungjin Lee, Shixiong Qi, and K. K. Ramakrishnan. 2026. Making MoE-based LLM Inference Resilient with Tarragon. arXiv:2601.01310 [cs.DC]https://arxiv.org/abs/2601.01310

  57. [57]

    Yusheng Zheng, Jiakun Fan, Quanzhi Fu, Yiwei Yang, Wei Zhang, and Andi Quinn. 2026. AgentCgroup: Understanding and Controlling OS Resources of AI Agents. arXiv:2602.09345 [cs.OS] 15