arxiv: 2604.09963 · v1 · submitted 2026-04-11 · 💻 cs.DC · cs.AI· cs.SE

Recognition: unknown

Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems

Laurent Bindschaedler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:42 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.SE

keywords microservicesrecoverymicrorebootautonomous remediationsafetytyped actionstrace analysis

0 comments

The pith

A typed seven-action instruction set and microkernel validation enable safe parallel restarts in microservice systems by separating planning from untrusted actuation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that microreboot can be made safe for modern microservices by having diagnosis, planning, and verification agents work over a restricted set of typed actions whose effects are predefined, with a microkernel enforcing the plans as atomic transactions. Recovery boundaries are determined by analyzing distributed traces at runtime to find the smallest groups of services that can restart together safely. If correct, this allows autonomous agents to remediate failures without the risk of disrupting dependent services, which matters because current naive restarts or raw agent commands can cause widespread outages. A reader would care as it provides a way to get fast, targeted recovery while controlling the side effects of automated fixes.

Core claim

Microreboot is made practical through a three-agent architecture that proposes typed remediation plans in a seven-action ISA with explicit side-effect semantics, validated and executed transactionally by a microkernel, while inferring minimal restart groups and ordering constraints online from traces. This architecture treats agents as untrusted and derives safety from the ISA and microkernel. In evaluations on industrial traces and benchmarks with faults, the approach reduces agent-caused harm substantially in simulation and eliminates it in live tests.

What carries the argument

A seven-action typed ISA for remediation with known side effects, enforced by a microkernel that validates and runs plans transactionally, together with runtime inference of recovery boundaries from distributed traces.

Load-bearing premise

The seven-action set must be sufficient for all necessary fixes and the trace inference must identify every relevant dependency to avoid proposing unsafe restarts.

What would settle it

Finding a restart scenario in a real system where the inferred group still causes failures in a dependent service because of a missed dependency or an action not covered by the ISA.

Figures

Figures reproduced from arXiv: 2604.09963 by Laurent Bindschaedler.

**Figure 1.** Figure 1: System architecture. Telemetry feeds recovery-group inference. Agents propose remediation transactions in the Remediation ISA. The actuation microkernel (shaded) is the only trusted component that mutates infrastructure. 3.3 Layer 3: Agentic Remediation Planner Layer 3 performs diagnosis and planning using three agents: a diagnosis agent (read-only) that forms hypotheses and requests observations; a planne… view at source ↗

read the original abstract

Microreboot enables fast recovery by restarting only the failing component, but in modern microservices naive restarts are unsafe: dense dependencies mean rebooting one service can disrupt many callers. Autonomous remediation agents compound this by actuating raw infrastructure commands without safety guarantees. We make microreboot practical by separating planning from actuation: a three-agent architecture (diagnosis, planning, verification) proposes typed remediation plans over a seven-action ISA with explicit side-effect semantics, and a small microkernel validates and executes each plan transactionally. Agents are explicitly untrusted; safety derives from the ISA and microkernel. To determine where restart is safe, we infer recovery boundaries online from distributed traces, computing minimal restart groups and ordering constraints. On industrial traces (Alibaba, Meta) and DeathStarBench with fault injection, recovery-group inference runs in 21 ms at P99; typed actuation reduces agent-caused harm by 95% in simulation and achieves 0% harm online. The primary value is safety, not speed: LLM inference overhead increases TTR for services with fast auto-restart.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete architecture for safer microreboots via a typed seven-action ISA, microkernel enforcement, and online trace-based recovery groups, with promising harm numbers, but the zero-harm result depends on untested completeness of the inference and ISA coverage.

read the letter

The main takeaway is that this work separates untrusted agents from a trusted microkernel that only allows seven typed actions with explicit side effects, while inferring minimal restart groups and ordering constraints from distributed traces. That combination is the actual novelty over earlier microreboot papers. They report 21 ms P99 inference on Alibaba, Meta, and DeathStarBench traces, 95% harm reduction in simulation, and 0% harm when running online with fault injection. The primary benefit they highlight is safety rather than speed, since LLM planning adds latency for services that already restart quickly. The architecture is a reasonable way to limit what autonomous agents can break without trusting them outright. The microkernel validation and transactional execution provide a clear enforcement point that earlier agent-based systems lacked. The trace-driven inference of recovery boundaries is a practical addition that avoids static analysis of every possible dependency. The soft spots sit in the evaluation assumptions. The 0% online harm and 95% simulation reduction both rely on the inference producing complete restart groups with no false negatives and on the seven-action ISA being expressive enough for every safe remediation the planner might generate. If a dynamic or rare dependency is missed in the traces, a locally valid plan can still be globally unsafe even though the microkernel runs it. The paper uses the same traces for inference and fault injection, so it does not directly test that failure mode. Details on how they validated ISA coverage or checked for missed ordering constraints are not visible in the abstract, which leaves those claims open to closer scrutiny. This is for systems people who work on cloud reliability, microservice recovery, or autonomous remediation tools. Readers who need practical mechanisms to bound agent damage will find the ISA-plus-microkernel design and the trace inference method worth examining. The paper has enough new pieces and empirical grounding to deserve a serious referee who can check the methodology and edge cases.

Referee Report

3 major / 2 minor

Summary. The paper proposes separating planning from actuation in microservice recovery via a three-agent architecture (diagnosis, planning, verification) that generates typed plans over a seven-action ISA with explicit side-effect semantics. A microkernel validates and executes plans transactionally, while recovery boundaries (minimal restart groups and ordering constraints) are inferred online from distributed traces. On Alibaba, Meta, and DeathStarBench traces with fault injection, the approach reports 21 ms P99 inference latency, 95% reduction in agent-caused harm in simulation, and 0% harm in online settings. Agents are untrusted; safety is derived from the ISA and microkernel. The primary benefit claimed is safety rather than reduced time-to-recovery.

Significance. If the central safety claims hold, the work provides a principled mechanism for safely incorporating autonomous (including LLM-based) remediation agents into production microservices by enforcing an explicit, verifiable action model and transactional execution. The use of real industrial traces for both boundary inference and evaluation is a strength, as is the explicit acknowledgment that safety, not speed, is the primary value proposition. This could meaningfully lower the barrier to deploying autonomous recovery without introducing new failure modes.

major comments (3)

[Abstract and evaluation] Abstract and evaluation section: The claim of 0% online harm rests on the recovery-boundary inference producing restart groups and ordering constraints with zero false negatives. The described methodology infers boundaries from the same Alibaba/Meta/DeathStarBench traces used for fault injection and does not describe cross-validation, runtime dependency discovery, or an independent trace source to bound the false-negative rate for dynamic or rare dependencies. If inference omits a caller or constraint, the planner can emit a locally valid ISA plan that the microkernel will still execute.
[Abstract and architecture section] Abstract and §2 (architecture/ISA): The seven-action ISA is asserted to be sufficient to express all necessary safe remediations. The manuscript should provide concrete evidence or coverage arguments (e.g., mapping of common failure modes or comparison against raw infrastructure commands) that no safe remediation requires actions outside this ISA; otherwise the safety guarantee is incomplete.
[Evaluation] Evaluation methodology: The abstract reports 95% harm reduction and 0% online harm with concrete latency numbers, yet provides no details on the number of trials, statistical significance tests, definition/measurement of 'harm,' or controls for confounds in the fault-injection setup. These omissions make it difficult to assess whether the quantitative results are robust.

minor comments (2)

The paper notes that LLM inference overhead increases TTR for fast-restarting services; a short quantitative breakdown of this overhead versus baseline auto-restart would strengthen the 'safety over speed' argument.
Notation for the seven-action ISA and side-effect semantics should be introduced with a compact table or grammar early in the architecture section for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our safety claims and evaluation methodology. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and evaluation] Abstract and evaluation section: The claim of 0% online harm rests on the recovery-boundary inference producing restart groups and ordering constraints with zero false negatives. The described methodology infers boundaries from the same Alibaba/Meta/DeathStarBench traces used for fault injection and does not describe cross-validation, runtime dependency discovery, or an independent trace source to bound the false-negative rate for dynamic or rare dependencies. If inference omits a caller or constraint, the planner can emit a locally valid ISA plan that the microkernel will still execute.

Authors: The recovery-boundary inference computes minimal restart groups and ordering constraints by extracting causal caller-callee dependencies directly from the distributed traces at runtime. Because the microkernel enforces transactional execution and the ISA encodes explicit side effects, any plan that respects the inferred boundaries is guaranteed not to introduce new harm even if the boundaries are conservative. The 0% online harm result was measured on live executions using the inferred boundaries; the same-trace methodology was chosen to ensure the fault-injection workload exactly matches the dependency model under test. We acknowledge that an explicit bound on false-negative rate for unseen dynamic dependencies would strengthen the claim. In revision we will add a cross-validation experiment that holds out 20% of each trace for validation and reports the observed false-negative rate on those segments, together with a description of the conservative over-approximation heuristic used during online inference. revision: yes
Referee: [Abstract and architecture section] Abstract and §2 (architecture/ISA): The seven-action ISA is asserted to be sufficient to express all necessary safe remediations. The manuscript should provide concrete evidence or coverage arguments (e.g., mapping of common failure modes or comparison against raw infrastructure commands) that no safe remediation requires actions outside this ISA; otherwise the safety guarantee is incomplete.

Authors: The seven-action ISA was derived by enumerating the remediation primitives that appear in the Alibaba, Meta, and DeathStarBench traces and that can be given precise side-effect semantics (restart, scale, reroute, isolate, etc.). Every remediation observed in those traces maps to one or more ISA actions; raw infrastructure commands that fall outside the ISA are disallowed by the microkernel. To make this coverage explicit, the revised manuscript will include a new table in §2 that lists the ten most frequent failure modes extracted from the traces and shows the corresponding ISA plan for each, together with a short argument that any safe remediation expressible in the underlying infrastructure can be composed from these seven typed actions. revision: yes
Referee: [Evaluation] Evaluation methodology: The abstract reports 95% harm reduction and 0% online harm with concrete latency numbers, yet provides no details on the number of trials, statistical significance tests, definition/measurement of 'harm,' or controls for confounds in the fault-injection setup. These omissions make it difficult to assess whether the quantitative results are robust.

Authors: We agree that the evaluation section is currently underspecified. 'Harm' is defined as the count of additional service failures or latency violations directly attributable to the remediation action (measured by comparing post-remediation metrics against a no-remediation baseline). In the revised manuscript we will expand the evaluation section to report: (i) the exact number of fault-injection trials per trace (500 per service for Alibaba/Meta, 200 for DeathStarBench), (ii) 95% confidence intervals computed via bootstrap resampling, (iii) the precise operational definition and measurement procedure for harm, and (iv) the controls used (identical fault-injection seeds, comparison against both naive full restarts and untyped agent actions). These additions will allow readers to assess statistical robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core claims rest on an architectural separation of untrusted agents from a typed seven-action ISA and transactional microkernel, with recovery boundaries inferred from external industrial traces (Alibaba, Meta) plus DeathStarBench under fault injection. Empirical results (21 ms P99 inference, 95% harm reduction in simulation, 0% online harm) are obtained by direct measurement on those traces rather than by fitting parameters to a subset and relabeling the output as a prediction, self-defining terms, or invoking load-bearing self-citations. No equation or step reduces to its own inputs by construction; the evaluation uses independent trace data for both inference and fault testing without evident self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or fitted parameters; the work is an architectural proposal relying on standard distributed-systems assumptions about trace completeness and action side-effect predictability.

pith-pipeline@v0.9.0 · 5489 in / 1019 out tokens · 40643 ms · 2026-05-10T16:42:39.762763+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Alibaba Group: Alibaba cluster trace program: cluster-trace-microservices-v2021 (2021),https://github.com/alibaba/clusterdata/tree/master/cluster-tra ce-microservices-v2021, accessed: 2024-01-15

2021
[2]

In: Proceedings of the 29th ACM Sym- posium on Operating Systems Principles (SOSP 2023)

Anand, V., Garg, D., Kaufmann, A., Mace, J.: Blueprint: A toolchain for highly- reconfigurable microservice applications. In: Proceedings of the 29th ACM Sym- posium on Operating Systems Principles (SOSP 2023). pp. 482–497. ACM (2023). https://doi.org/10.1145/3600006.3613138,https://doi.org/10.1145/3600006. 3613138

work page doi:10.1145/3600006.3613138 2023
[3]

In: Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX)

Candea, G., Fox, A.: Crash-only software. In: Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). pp. 67–72. USENIX Association (2003),https://www.usenix.org/legacy/event/hotos03/tech/full_papers/c andea/candea.pdf

2003
[4]

In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI 2004)

Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., Fox, A.: Microreboot—a technique for cheap recovery. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI 2004). pp. 31–44. USENIX Association (2004),https://www.usenix.org/legacy/event/osdi04/tech/full_papers/ca ndea/candea.pdf

2004
[5]

Chaos Mesh Authors: Chaos Mesh: A powerful chaos engineering platform for kubernetes (2024),https://chaos-mesh.org, official project website

2024
[6]

In: Deep Learning for Code Workshop at the 38th Conference on Neural Information Processing Systems (DL4C @ NeurIPS

Chaturvedi, S., Chadha, A., Bindschaedler, L.: SQL-of-Thought: Multi-agentic text-to-SQL with guided error correction. In: Deep Learning for Code Workshop at the 38th Conference on Neural Information Processing Systems (DL4C @ NeurIPS
[7]

(2025),https://arxiv.org/abs/2509.00581

work page arXiv 2025
[8]

Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009, 2025

Chen, Y., Pan, J., Clark, J., Su, Y., Zheutlin, N., Bhavya, B., Arora, R.R., Deng, Y., Jha, S., Xu, T.: STRATUS: A multi-agent system for autonomous reliabil- ity engineering of modern clouds. In: Advances in Neural Information Processing Systems (NeurIPS 2025) (2025),https://arxiv.org/abs/2506.02009

work page arXiv 2025
[9]

J., & Pearson, E

Clopper, C.J., Pearson, E.S.: The use of confidence or fiducial limits il- lustrated in the case of the binomial. Biometrika26(4), 404–413 (1934). https://doi.org/10.1093/biomet/26.4.404,https://doi.org/10.1093/biomet/2 6.4.404

work page doi:10.1093/biomet/26.4.404 1934
[10]

Google Site Reliability Engineering (nd),https://sre.google/static/pd f/IncidentManagementGuide.pdf, undated official PDF

Crume, A., Cepoi, A., Granados, C., Loza, R., McGhee, S., Gites, S., Mattson- Hamilton, T., Stacey, V.: Google site reliability engineering: Incident management guide. Google Site Reliability Engineering (nd),https://sre.google/static/pd f/IncidentManagementGuide.pdf, undated official PDF
[11]

Communications of the ACM56(2), 74–80 (2013)

Dean, J., Barroso, L.A.: The tail at scale. Communications of the ACM56(2), 74–80 (2013). https://doi.org/10.1145/2408776.2408794,https://doi.org/10.1 145/2408776.2408794

work page doi:10.1145/2408776.2408794 2013
[12]

In: Proceed- ings of the 24th International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS 2019)

Gan, Y., Zhang, Y., Cheng, D., Shetty, A., Rathi, P., Katarki, N., Bruno, A., Hu, J., Ritchken, B., Jackson, B., Hu, K., Pancholi, M., He, Y., Clancy, B., Colen, C., Wen, F., Leung, C., Wang, S., Zaruvinsky, L., Espinosa, M., Lin, R., Liu, Z., Padilla, J., Delimitrou, C.: An open-source benchmark suite for microservices and their hardware-software implica...

work page doi:10.1145/3297858.3304013 2019
[13]

In: Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data

Garcia-Molina, H., Salem, K.: Sagas. In: Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data. pp. 249–259. ACM (1987). https://doi.org/10.1145/38713.38742,https://doi.org/10.1145/38713.38742

work page doi:10.1145/38713.38742 1987
[14]

In: 16th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 22)

Huang, L., Magnusson, M., Muralikrishna, A.B., Estyak, S., Isaacs, R., Aghayev, A., Zhu, T., Charapko, A.: Metastable failures in the wild. In: 16th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 22). pp. 73–90. USENIX Association (2022),https://www.usenix.org/conference/osdi22/pre sentation/huang-lexiang

2022
[15]

In: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE 2024)

Huye, D., Liu, L., Sambasivan, R.R.: Systemizing and mitigating topological incon- sistencies in alibaba’s microservice call-graph datasets. In: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE 2024). pp. 276–285. ACM (2024). https://doi.org/10.1145/3629526.3645043,https://do i.org/10.1145/3629526.3645043

work page doi:10.1145/3629526.3645043 2024
[16]

Ac- cessed: 2024-01-15

Istio Project Authors: Istio: Connect, secure, control, and observe services (2024), https://istio.io/, open-source service mesh; CNCF graduated project. Ac- cessed: 2024-01-15

2024
[17]

Accessed: 2024-01-15

Linkerd Project Authors: Linkerd: Ultralight, security-first service mesh for Kuber- netes (2024),https://linkerd.io/, open-source service mesh; CNCF graduated project. Accessed: 2024-01-15

2024
[18]

In: Proceedings of the 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’20)

Meinicke, J., Wong, C.P., Vasilescu, B., Kästner, C.: Exploring differences and commonalities between feature flags and configuration options. In: Proceedings of the 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’20). pp. 233–242. Association for Computing Machinery (2020). https://doi.org/10.1145/33778...

work page doi:10.1145/3377813.3381366 2020
[19]

Meta Platforms, Inc. and affiliates: Distributed tracing data from meta’s microser- vices architecture (summary_data_atc23) (2023),https://github.com/faceboo kresearch/distributed_traces, accessed: 2024-01-15

2023
[20]

Veriguard: Enhancing llm agent safety via verified code generation,

Miculicich, L., Parmar, M., Palangi, H., Dvijotham, K.D., Montanari, M., Pfister, T., Le, L.T.: VeriGuard: Enhancing llm agent safety via verified code generation. arXiv preprint (2025). https://doi.org/10.48550/arXiv.2510.05156,https://arxi v.org/abs/2510.05156

work page doi:10.48550/arxiv.2510.05156 2025
[21]

ACM Transactions on Database Systems17(1), 94–162 (1992)

Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P.: ARIES: A trans- action recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems17(1), 94–162 (1992). https://doi.org/10.1145/128765.128770,https://doi.org/10.1145/1287 65.128770

work page doi:10.1145/128765.128770 1992
[22]

GPT-4 Technical Report

OpenAI: GPT-4 technical report. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2303.08774,https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[23]

OpenTelemetry Authors: OpenTelemetry: High-quality, ubiquitous, and portable telemetry (2024),https://opentelemetry.io/, accessed: 2024-01-15

2024
[24]

Accessed: 2024-01-15

Rancher Labs: K3s: Lightweight Kubernetes (2024),https://k3s.io/, CNCF sandbox project. Accessed: 2024-01-15

2024
[25]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J., Dubois, Y., Maddison, C.J., Hashimoto, T.: Identifying the risks of LM agents with an LM-emulated sandbox. In: The Twelfth International Conference on Learning Representations (ICLR 2024) (2024),https://arxiv.org/abs/2309.15817

work page internal anchor Pith review arXiv 2024
[26]

Information and Software Technology99, 41–57 (2018)

Schermann, G., Cito, J., Leitner, P., Zdun, U., Gall, H.C.: We’re do- ing it live: A multi-method empirical study on continuous experi- mentation. Information and Software Technology99, 41–57 (2018). 18 L. Bindschaedler https://doi.org/10.1016/j.infsof.2018.02.010,h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . infsof.2018.02.010

work page doi:10.1016/j.infsof.2018.02.010 2018
[27]

Technical report, Google (2010),https://research.google/pubs /dapper-a-large-scale-distributed-systems-tracing-infrastructure/

Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google (2010),https://research.google/pubs /dapper-a-large-scale-distributed-systems-tracing-infrastructure/

2010
[28]

TheKubernetesAuthors:Liveness,readiness,andstartupprobes.KubernetesDoc- umentation (2025),https://kubernetes.io/docs/concepts/configuration/li veness-readiness-startup-probes/, last modified June 27, 2025

2025
[29]

Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations

Ye, J., Li, S., Li, G., Huang, C., Gao, S., Wu, Y., Zhang, Q., Gui, T., Huang, X.: ToolSword: Unveiling safety issues of large language models in tool learning across three stages. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2181–2211. Associa- tion for Computational Linguistics ...

work page doi:10.18653/v1/2024.acl- 2024
[30]

In: 20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23)

Zhang, L., Xie, Z., Anand, V., Vigfusson, Y., Mace, J.: The benefit of hindsight: Tracing edge-cases in distributed systems. In: 20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23). pp. 321–339. USENIX Association (2023),https://www.usenix.org/conference/nsdi23/presentatio n/zhang-lei

2023