arxiv: 2604.08059 · v4 · submitted 2026-04-09 · 💻 cs.RO · cs.AI

Recognition: no theorem link

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

Xue Qin , Simin Luan , John See , Cong Yang , Zhijun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:42 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords governed capability evolutionAI componentsembodied agentscompatibility checksrollbackstaged deploymentruntime governancesoftware lifecycle

0 comments

The pith

Governed upgrades keep AI agent success at 67% with zero unsafe cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a lifecycle governance method for updating versioned AI capability modules in systems like embodied agents, where each new version must be validated before activation to avoid risks. Existing deployment techniques handle stateless services but fall short for stateful, policy-bound AI runtimes that operate under constraints. The work introduces four compatibility checks and arranges them into a pipeline of sandbox evaluation, shadow deployment, gated activation, monitoring, and rollback. Experiments on a simulation testbed across multiple upgrade rounds show the governed process achieves comparable task performance while eliminating unsafe activations that arise in direct replacements. This addresses the need for safe evolution in AI-driven agents that must adapt over time without introducing failures or violations.

Core claim

The paper formulates governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and proposes a staged upgrade framework. Every new capability version receives four compatibility checks—interface, policy, behavioral, and recovery—organized into candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. On a PyBullet/ROS 2 testbed over 6 upgrade rounds with 15 random seeds, governed upgrades retain 67.4% task success with zero unsafe activations, while naive upgrades reach 72.9% success but drive unsafe activations to 60% by the final round. Shadow deployment detects 40% of regressions missed by

What carries the argument

The staged upgrade framework that applies four compatibility checks (interface, policy, behavioral, recovery) through a sequence of candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback to each new AI capability version.

Load-bearing premise

The four compatibility checks are sufficient to detect all unsafe evolutions and the PyBullet/ROS 2 testbed with random seeds adequately represents real-world embodied agent upgrade scenarios.

What would settle it

Observing unsafe activations in physical robot deployments after the checks pass, or a statistically significant drop in task success under governed upgrades, would falsify the framework's effectiveness.

Figures

Figures reproduced from arXiv: 2604.08059 by Cong Yang, John See, Simin Luan, Xue Qin, Zhijun Li.

**Figure 1.** Figure 1: Governance profile comparison across six deployment metrics. All axes are oriented so that outer = better. Governed Upgrade (blue) achieves near-complete coverage across safety and recoverability dimensions while retaining competitive task success. Naïve Upgrade (red) collapses on screening (BADR), false-accept control (1− FAR), and rollback (RSR). Static (gray, dashed) is trivially safe but forgoes all ca… view at source ↗

**Figure 2.** Figure 2: Overview of governed capability evolution. Top: Naïve upgrade directly replaces the active capability version without governance. Bottom: The governed upgrade pipeline treats each new version as a candidate that must pass compatibility validation (𝜅𝐼 , 𝜅𝑃 ), sandbox and shadow evaluation (𝜅𝐵 , 𝜅𝑅 ), and gated activation (𝜃act) before entering the active system. Online monitoring continues after activation;… view at source ↗

**Figure 3.** Figure 3: Lifecycle coverage of prior work relative to the six stages of governed capability evolution. Each row is a research community; each horizontal band is a lifecycle stage, progressing left-to-right from pre-deployment (Package, Validate, Sandbox, Shadow) to post-activation (Activate, Monitor). The Validate stage is decomposed into four compatibility subchecks (𝜅𝐼 interface, 𝜅𝑃 policy, 𝜅𝐵 behavioral, 𝜅𝑅 rec… view at source ↗

**Figure 4.** Figure 4: Performance and deployment safety over upgrade rounds. (a) Task success rate across repeated capabilityupgrade rounds for Static, Naïve Upgrade, and Governed Upgrade. All three strategies achieve comparable task success (65–73%), demonstrating that governance does not sacrifice nominal performance. Naïve Upgrade shows slightly higher variance because faulty candidates occasionally improve or degrade succe… view at source ↗

**Figure 5.** Figure 5: Shadow deployment reveals upgrade regressions not exposed by isolated evaluation. Each bar shows the mean number of detections per seed (5 seeds total). Dark bars indicate regressions visible in sandbox evaluation; light bars indicate regressions discovered only during shadow deployment. Retry instability is entirely invisible to sandbox evaluation but is reliably surfaced by shadow deployment under live t… view at source ↗

read the original abstract

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whetmeher the new version may be activated safely, under what deployment conditions, with what monitoring, and when it should be rolled back. Existing software-deployment patterns (canary, blue-green, feature flags, MLOps pipelines) address parts of this loop but were designed for stateless web services rather than stateful, policy-constrained runtimes that drive AI components in the field. We study this problem in the setting of embodied agents, where capabilities are packaged as installable modules under runtime policy and recovery constraints. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework that treats every new capability version as a governed deployment candidate rather than an immediate replacement. The framework introduces four compatibility checks (interface, policy, behavioral, recovery) and organizes them into a staged pipeline of candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. A reference prototype on a PyBullet/ROS 2 testbed evaluated over 6 upgrade rounds with 15 random seeds shows naive upgrade reaches 72.9% task success but drives unsafe activation to 60% by the final round, while governed upgrade retains comparable success (67.4%) with zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment surfaces 40% of regressions invisible to sandbox alone, and rollback succeeds in 79.8% of post-activation drift scenarios. The work extends runtime governance from action execution to capability evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete four-check pipeline for governing AI capability upgrades in embodied systems, with sim results showing clear safety gains over naive upgrades.

read the letter

The main takeaway is that this work treats evolving AI modules in robots as a lifecycle governance problem and offers a staged pipeline—interface, policy, behavioral, and recovery checks, plus sandbox, shadow deployment, gated activation, monitoring, and rollback—to handle it. Their PyBullet/ROS 2 experiments over six upgrade rounds with 15 seeds show the governed approach keeps task success near the naive baseline (67.4% vs 72.9%) while driving unsafe activations to zero, with a Wilcoxon p-value of 0.003 and 79.8% rollback success on drift cases. Shadow deployment also caught 40% more regressions than sandbox alone. That is useful, practical data on a real deployment pain point where standard canary or blue-green patterns fall short for stateful policy-constrained agents. The framing as first-class lifecycle management for AI components is the clearest new angle. The results are direct and the metrics do not appear circular. The soft spot is the simulation environment. PyBullet with random seeds does not capture sensor noise, contact dynamics, or hardware drift that could let unsafe states pass the four checks on real hardware. The zero unsafe rate is internally consistent but rests on the checks themselves defining safety, with no independent oracle shown. The abstract leaves the exact implementation of the checks thin, though the full text presumably expands on that. This is for robotics and autonomous systems researchers focused on safe runtime governance and MLOps for agents. Readers working on deployment pipelines or policy enforcement in embodied AI will find the staged approach and numbers worth examining. It deserves a serious referee because it has a clear proposal, concrete experiments, and addresses a practical gap, even if revisions will likely target the sim-to-real question and check validation.

Referee Report

3 major / 2 minor

Summary. The paper claims that a staged upgrade framework with four compatibility checks (interface, policy, behavioral, recovery) enables governed capability evolution for AI-component-based systems. Using embodied agents as case study, it organizes checks into a pipeline of candidate validation, sandbox evaluation, shadow deployment, gated activation, monitoring, and rollback. On a PyBullet/ROS 2 testbed over 6 upgrade rounds with 15 seeds, governed upgrades retain 67.4% task success with 0% unsafe activations (vs. 72.9% success but 60% unsafe for naive upgrades; Wilcoxon p=0.003), with shadow deployment surfacing 40% of regressions and rollback succeeding in 79.8% of drift cases.

Significance. If the central result holds, the work provides a concrete lifecycle governance approach for versioned AI capabilities in policy-constrained, stateful systems, extending beyond stateless deployment patterns like canary releases. The empirical separation in unsafe rates, use of shadow deployment, and rollback metrics are strengths; the framework treats upgrades as first-class governed events rather than direct replacements.

major comments (3)

[Evaluation section (abstract and §5)] Evaluation section (abstract and §5): The 0% unsafe activation rate for governed upgrades is defined and detected using the same four checks that the framework asserts will prevent unsafe evolutions. No independent oracle or ground-truth labeling of unsafe states (separate from the checks) is described, so the result demonstrates internal consistency within the testbed but does not independently confirm that all unsafe evolutions are caught.
[Framework (§3) and experimental setup] Framework (§3) and experimental setup: The sufficiency of the four checks to detect unsafe evolutions is load-bearing for the claim of zero unsafe activations, yet the PyBullet/ROS 2 simulation omits real-world factors (sensor noise, unmodeled contact dynamics, hardware drift) that could produce policy-violating states passing the checks. The paper provides no discussion or additional validation of this assumption.
[Abstract and §4] Abstract and §4: Implementation details for the four checks (how interface, policy, behavioral, and recovery are realized and validated in the testbed) are not provided, leaving the central empirical claim dependent on unshown mechanisms.

minor comments (2)

[Abstract] Abstract contains a typo: 'whetmeher' should read 'whether'.
[Throughout] Ensure consistent use of terms like 'unsafe activation' across sections and figures; clarify how task success is measured independently of the governance pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and limitations of our evaluation. We have revised the manuscript to address each point by adding implementation details, an independent definition of unsafe states, and an expanded limitations discussion. Our responses to the major comments follow.

read point-by-point responses

Referee: The 0% unsafe activation rate for governed upgrades is defined and detected using the same four checks that the framework asserts will prevent unsafe evolutions. No independent oracle or ground-truth labeling of unsafe states (separate from the checks) is described, so the result demonstrates internal consistency within the testbed but does not independently confirm that all unsafe evolutions are caught.

Authors: We agree this is a valid concern and that the primary evaluation metric is tied to the checks themselves. In the revised §5 we now provide an independent definition of unsafe states based on post-activation runtime monitoring: any state in which the embodied agent violates the declared policy (e.g., obstacle collision) or exhibits a statistically significant drop in task success rate relative to the baseline, measured continuously and separately from the pre-activation pipeline. We added a new table comparing these independent post-activation indicators against the check outcomes for both governed and naive upgrades, showing alignment. We acknowledge that a fully external oracle (human labeling or physical testbed) is outside the current simulation study and have noted this limitation explicitly. revision: partial
Referee: The sufficiency of the four checks to detect unsafe evolutions is load-bearing for the claim of zero unsafe activations, yet the PyBullet/ROS 2 simulation omits real-world factors (sensor noise, unmodeled contact dynamics, hardware drift) that could produce policy-violating states passing the checks. The paper provides no discussion or additional validation of this assumption.

Authors: We accept the point that the simulation environment is idealized. The revised manuscript adds a new 'Limitations and Assumptions' subsection in §5 that explicitly discusses sensor noise, unmodeled dynamics, and hardware drift, explains why the current checks use conservative thresholds to provide margin, and states that the framework's claims are scoped to controlled simulation settings. We also outline planned physical-robot validation as future work. The core empirical comparison (governed vs. naive) remains valid within the reported testbed, but we no longer imply broader real-world sufficiency without further evidence. revision: yes
Referee: Implementation details for the four checks (how interface, policy, behavioral, and recovery are realized and validated in the testbed) are not provided, leaving the central empirical claim dependent on unshown mechanisms.

Authors: We apologize for the missing details. In the revised §4 we have added concrete implementation descriptions and pseudocode for each check as realized in the PyBullet/ROS 2 testbed: the interface check performs schema and API signature matching; the policy check invokes a runtime policy verifier against declared invariants; the behavioral check runs sandboxed trajectory comparison against reference behaviors with a distance threshold; and the recovery check validates rollback trigger conditions and state restoration. We also include a short validation subsection reporting per-check pass rates on the 15 seeds. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical framework evaluation

full rationale

The paper proposes a staged upgrade framework with four compatibility checks and reports direct experimental outcomes (task success rates of 67.4% vs 72.9%, zero unsafe activations) from a PyBullet/ROS 2 simulation testbed across 6 upgrade rounds and 15 seeds. These metrics are measured observations in the environment and do not reduce to any fitted parameters, self-definitions, or predictions that loop back to the framework inputs by construction. No mathematical derivations, uniqueness theorems, or self-citation chains are load-bearing for the central claims; the evaluation stands as an independent empirical demonstration within the stated testbed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that embodied agents operate under explicit runtime policies and recovery constraints that can be checked at upgrade time; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Embodied agents can be adequately modeled and tested in PyBullet/ROS 2 for upgrade safety evaluation
The evaluation uses this simulation environment to measure unsafe activations and task success.

pith-pipeline@v0.9.0 · 5624 in / 1257 out tokens · 23957 ms · 2026-05-11T00:42:08.629361+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation
cs.RO 2026-04 unverdicted novelty 5.0

Multi-robot coordination is achieved by federating single-agent robot runtimes at the fleet level instead of fragmenting each robot into multiple internal agents.
ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents
cs.SE 2026-04 unverdicted novelty 5.0

ECM Contracts define a six-dimensional contract model for embodied capability modules that enables static checks for safe composition, installation, and versioned upgrades in robotics systems.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 3 Pith papers · 6 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Do as i can, not as i say: Grounding language in robotic affordances.arXiv:2204.01691. Ahn, M., Dwibedi, D., Finn, C., Arenas, M.G., Gopalakrishnan, K., Hausman, K., Ichter, B., Irpan, A., Joshi, N., Julian, R., et al.,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.,

AutoRT: Embodied foundation models for large scale orchestration of robotic agents.arXiv:2401.12963. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.,

work page arXiv
[3]

In: AAAI

Safe reinforcement learning via shielding, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v32i1.11797. Ames, A.D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., Tabuada, P.,

work page doi:10.1609/aaai.v32i1.11797
[4]

3420–3431

Control barrier functions: Theory and applications, in: European Control Conference (ECC), pp. 3420–3431. doi:10.23919/ECC.2019.8796030. Ashmore, R., Calinescu, R., Paterson, C.,

work page doi:10.23919/ecc.2019.8796030 2019
[5]

Assuring the Machine Learning Lifecycle : Desiderata , Methods , and Challenges

Assuring the machine learning lifecycle: Desiderata, methods, and challenges. ACM Computing Surveys 54, 1–39. doi:10.1145/3453444. Bartocci, E., Deshmukh, J., Donzé, A., Fainekos, G., Maler, O., Ničković, D., Sankaranarayanan, S.,

work page doi:10.1145/3453444
[6]

Agent Behavioral Contracts: Formal Specification and Runtime Enforcement,

Agent behavioral contracts: Formal specification and runtime enforcement for reliable autonomous AI agents. arXiv:2602.22302. Breck, E., Cai, S., Nielsen, E., Salib, M., Sculley, D.,

work page arXiv
[7]

1123–1132

The ML test score: A rubric for ML production readiness and technical debt reduction, in: 2017 IEEE International Conference on Big Data (Big Data), pp. 1123–1132. doi:10.1109/BigData.2017.8258038. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., et al., 2023a. RT-1: Robotics transformer for real-world control at sca...

work page doi:10.1109/bigdata.2017.8258038 2017
[8]

Annual Review of Control, Robotics, and Autonomous Systems 5, 411–444

Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems 5, 411–444. doi:10.1146/ annurev-control-042920-020211. Bruyninckx,H.,2001. Openrobotcontrolsoftware:TheOROCOSproject,in:IEEEInternationalConferenceonRoboticsandAutomation(ICRA), pp. 2523–2528. doi:10.1109/ROBOT...

work page doi:10.1109/robot.2001.933002 2001
[9]

volume 12 ofSynthesis Lectures on Artificial Intelligence and Machine Learning

Lifelong Machine Learning. volume 12 ofSynthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool. doi:10.2200/S00832ED1V01Y201802AIM037. Colledanchise, M., Ögren, P.,

work page doi:10.2200/s00832ed1v01y201802aim037
[10]

CRC Press

Behavior Trees in Robotics and AI: An Introduction. CRC Press. doi:10.1201/9780429489105. Devin,C.,Gupta,A.,Darrell,T.,Abbeel,P.,Levine,S.,2017. Learningmodularneuralnetworkpoliciesformulti-taskandmulti-robottransfer,in: IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. doi:10.1109/ICRA.2017.7989250. Forsgren,N.,Humble,J.,Kim...

work page doi:10.1201/9780429489105 2017
[11]

progressivedelivery

Towards progressive delivery. RedMonk blog post. URL:https://redmonk.com/jgovernor/2018/08/06/ towards-progressive-delivery/.coiningoftheterm“progressivedelivery”forstagedproductionrolloutsthatextendcontinuousdelivery with feature flags, canary releases, and shadow deployments. Hobbs,K.L.,Mote,M.L.,Abate,M.C.L.,Coogan,S.D.,Feron,E.M.,2023. Runtimeassuranc...

work page doi:10.1109/mcs.2023.3234380 2018
[12]

Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al.,

doi:10.18653/v1/2024.findings-emnlp.585. Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al.,

work page doi:10.18653/v1/2024.findings-emnlp.585 2024
[13]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Inner monologue: Embodied reasoning through planning with language models.arXiv:2207.05608. Humble,J.,Farley,D.,2010. ContinuousDelivery:ReliableSoftwareReleasesthroughBuild,Test,andDeploymentAutomation. Addison-Wesley. Könighofer, B., Bloem, R., Ehlers, R., Pek, C.,

work page internal anchor Pith review arXiv 2010
[14]

Lam, P., Dietrich, J., Pearce, D.J.,

Correct-by-construction runtime enforcement in AI: A survey.arXiv:2208.14426. Lam, P., Dietrich, J., Pearce, D.J.,

work page arXiv
[15]

Putting the semantics into semantic versioning, in: ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), pp. 157–179. doi:10.1145/3426428.3426922. Liang,J.,Huang,W.,Xia,F.,Xu,P.,Hausman,K.,Ichter,B.,Florence,P.,Zeng,A.,2023. Codeaspolicies:Languagemodelprogramsforembodied control, in: IEEE...

work page doi:10.1145/3426428.3426922 2023
[16]

arXiv:2603.07442

LITHE: Bridging best-effort Python and real-time C++ for hot-swapping robotic control laws on commodity Linux. arXiv:2603.07442. Luckcuck,M.,Farrell,M.,Dennis,L.A.,Dixon,C.,Fisher,M.,2019. Formalspecificationandverificationofautonomousroboticsystems:Asurvey. ACM Computing Surveys 52, 1–41. doi:10.1145/3342355. Macenski, S., Foote, T., Gerkey, B.P., Lalanc...

work page doi:10.1145/3342355 2019
[17]

Robot operating system 2: Design, architecture, and uses in the wild.Science Robotics, 7(66): eabm6074, 2022

Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics 7, eabm6074. doi:10.1126/scirobotics.abm6074. Mahdavi-Hezaveh,R.,Dabrowski,J.,Williams,L.,2021. Softwaredevelopmentwithfeaturetoggles:Practicesusedbypractitioners,in:Empirical Software Engineering. doi:10.1007/s10664-020-09901-z,arXiv:1907.06157. Metta, G., Fitzpatrick,...

work page doi:10.1126/scirobotics.abm6074 2021
[18]

International Journal of Advanced Robotic Systems 3, 43–48

YARP: Yet another robot platform. International Journal of Advanced Robotic Systems 3, 43–48. doi:10.5772/5761. Open X-Embodiment Collaboration,

work page doi:10.5772/5761
[19]

Open X-Embodiment: Robotic learning datasets and RT-X models, in: IEEE International Conference on Robotics and Automation (ICRA).arXiv:2310.08864. X. Qin et al.:Preprint submitted to ElsevierPage 38 of 39 Governed Capability Evolution Paleyes, A., Urma, R.G., Lawrence, N.D.,

work page internal anchor Pith review arXiv
[20]

ACM Computing Surveys 55, 1–29

Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys 55, 1–29. doi:10.1145/3533378. Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.,

work page doi:10.1145/3533378
[21]

Neural Networks 113, 54–71

Continual lifelong learning with neural networks: A review. Neural Networks 113, 54–71. doi:10.1016/j.neunet.2019.01.012. Peng,X.B.,Chang,M.,Zhang,G.,Abbeel,P.,Levine,S.,2019. MCP:Learningcomposablehierarchicalcontrolwithmultiplicativecompositional policies, in: Advances in Neural Information Processing Systems (NeurIPS).arXiv:1905.09808. Perez,I.,Mavrido...

work page doi:10.1016/j.neunet.2019.01.012 2019
[22]

Pertsch, Y

Accelerating reinforcement learning with learned skill priors, in: Proceedings of the 4th Conference on Robot Learning (CoRL).arXiv:2010.11944. Pritchard, S., Nagaraju, V., Fiondella, L.,

work page arXiv 2010
[23]

Automating staged rollout with reinforcement learning, in: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pp. 16–20. doi:10.1145/3510455. 3512782. Qin, X., Luan, S., See, J., Yang, C., Li, Z., 2026a. AEROS: A single-agent operating architecture with embodied capability modules...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3510455
[24]

URL:http://www.robotics.stanford.edu/~ang/papers/icraoss09-ROS.pdf

ROS: An open-source robot operating system, in: ICRA Workshop on Open Source Software. URL:http://www.robotics.stanford.edu/~ang/papers/icraoss09-ROS.pdf. Raemaekers,S.,vanDeursen,A.,Visser,J.,2017. SemanticversioningandimpactofbreakingchangesintheMavenrepository. JournalofSystems and Software 129, 140–158. doi:10.1016/j.jss.2016.04.008. Ravichandran, Z.,...

work page doi:10.1016/j.jss.2016.04.008 2017
[25]

Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,

Safety guardrails for LLM-enabled robots.arXiv:2503.07885. Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,

work page arXiv
[26]

& Cohen, J

NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv:2310.10501. Rosenthal, C., Jones, N.,

work page arXiv
[27]

Springer, pp

SkiROS—a skill-based robot control platform on top of ROS, in: Robot Operating System (ROS). Springer, pp. 121–160. doi:10.1007/978-3-319-54927-9_4. Royce, R., et al.,

work page doi:10.1007/978-3-319-54927-9_4
[28]

Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent,

Enabling novel mission operations and interactions with ROSA: The robot operating system agent.arXiv:2410.06472. Scheutz, M.,

work page arXiv
[29]

Toward Verified Artificial Intelligence,

The TRADE middleware for advanced robotic architectures, in: Proceedings of the AAAI Symposium Series. doi:10.1609/ aaaiss.v7i1.36951. Sculley,D.,Holt,G.,Golovin,D.,Davydov,E.,Phillips,T.,Ebner,D.,Chaudhary,V.,Young,M.,Crespo,J.F.,Dennison,D.,2015. Hiddentechnical debt in machine learning systems, in: Advances in Neural Information Processing Systems (Neu...

work page doi:10.1145/3503914 2015
[30]

IEEE Software 18, 20–28

Using simplicity to control complexity. IEEE Software 18, 20–28. doi:10.1109/MS.2001.936213. Shamsujjoha, M., Lu, Q., Zhao, D., Zhu, L.,

work page doi:10.1109/ms.2001.936213 2001
[31]

Shi, L.X., Lim, J.J., Lee, Y.,

Swiss cheese model for AI safety: A taxonomy and reference architecture for multi-layered guardrails of foundation model based agents.arXiv:2408.02205. Shi, L.X., Lim, J.J., Lee, Y.,

work page arXiv
[32]

Skill-based model-based reinforcement learning.arXiv preprint arXiv:2207.07560,

Skill-based model-based reinforcement learning, in: Conference on Robot Learning (CoRL). arXiv:2207.07560. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., Yao, S.,

work page arXiv
[33]

Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: Language agents with verbal reinforcement learning, in: Advances in Neural Information Processing Systems (NeurIPS).arXiv:2303.11366. Sutton,R.S.,Precup,D.,Singh,S.,1999.BetweenMDPsandsemi-MDPs:Aframeworkfortemporalabstractioninreinforcementlearning.Artificial Intelligence 112, 181–211. doi:10.1016/S0004-3702(99)00052-1. Tan, H., et al.,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0004-3702(99)00052-1 1999
[34]

arXiv preprint arXiv:2505.03673 (2025)

RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv:2505.03673. Vemprala, S.H., Bonatti, R., Bucker, A., Kapoor, A.,

work page arXiv
[35]

IEEE Access 12, 55682– 55696

ChatGPT for robotics: Design principles and model abilities. IEEE Access 12, 55682– 55696. doi:10.1109/ACCESS.2024.3387941. Wang, C.L., Singhal, T., Kelkar, A., Tuo, J., 2025a. MI9: An integrated runtime governance framework for agentic AI.arXiv:2508.03858. Wang,G.,Xie,Y.,Jiang,Y.,Mandlekar,A.,Xiao,C.,Zhu,Y.,Fan,L.,Anandkumar,A.,2023. Voyager:Anopen-ended...

work page doi:10.1109/access.2024.3387941 2024
[36]

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents, in: Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE).arXiv:2503.18666. to appear. Wang, H., Poskitt, C.M., Wei, J., Sun, J., 2025b. ProbGuard: Probabilistic runtime monitoring for LLM agent safety.arXiv:2508.00500. arXiv v3 (March 2026); o...

work page internal anchor Pith review arXiv 2026
[37]

Springer

Experimentation in Software Engineering. Springer. doi:10.1007/978-3-642-29044-2. Zhang, W., Kong, X., Braunl, T., Hong, J.B.,

work page doi:10.1007/978-3-642-29044-2
[38]

Safeembodai: a safety framework for mobile robots in embodied ai systems.arXiv preprint arXiv:2409.01630, 2024

SafeEmbodAI: A safety framework for mobile robots in embodied AI systems. arXiv:2409.01630. Zhao, Z., Liu, M., Deb, A.,

work page arXiv
[39]

Safely and quickly deploying new features with a staged rollout framework using sequential test and adaptive experimental design.arXiv:1905.10493. X. Qin et al.:Preprint submitted to ElsevierPage 39 of 39

work page arXiv 1905