Governance by Construction for Generalist Agents

Alon Oved; Avi Yaeli; Harold Ship; Ido Levy; Iftach Shoham; Nir Mashkif; Offer Akrabi; Sami Marreed; Segev Shlomov; Sergey Zeltyn

arxiv: 2605.20874 · v1 · pith:7G326TAOnew · submitted 2026-05-20 · 💻 cs.AI · cs.SE

Governance by Construction for Generalist Agents

Segev Shlomov , Iftach Shoham , Alon Oved , Ido Levy , Sami Marreed , Harold Ship , Offer Akrabi , Sergey Zeltyn

show 2 more authors

Avi Yaeli Nir Mashkif

This is my paper

Pith reviewed 2026-05-21 04:48 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords LLM agentspolicy as codegovernanceenterprise AIcompliancehuman-in-the-loopagent workflows

0 comments

The pith

A modular policy-as-code layer steers generalist LLM agents through five checkpoints to enforce compliance without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that governance can be built directly into the execution pipeline of a generalist agent by intercepting it at fixed structural points rather than retraining or rebuilding it for each new domain. These interventions occur upstream of planning, inside the reasoning prompt, at tool boundaries, as human approval gates, and on final outputs. A reader would care because this setup promises to make autonomous agents usable in regulated settings such as healthcare while keeping behavior predictable and auditable across compound workflows.

Core claim

The CUGA policy system composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning by intercepting the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter).

What carries the argument

Five structural checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter) that perform policy interventions at successive stages of the agent's execution pipeline.

If this is right

Agents can be deployed in new regulated domains by updating only the policy definitions rather than retraining the base model.
High-risk actions are automatically routed through human approval without altering the agent's core reasoning loop.
Policy adherence and execution traces become continuously auditable because interventions are explicit and logged at each checkpoint.
Different enterprise workflows can reuse the same generalist agent by swapping modular policy sets at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the checkpoints prove robust, organizations could standardize on a small set of governance primitives across multiple agent platforms.
The same interception pattern might extend to non-LLM autonomous systems that perform sequenced tool use.
Runtime policy injection could reduce the cost of adapting agents to changing regulations compared with periodic fine-tuning cycles.

Load-bearing premise

External policy interventions at the five checkpoints can reliably steer and constrain generalist LLM agents across arbitrary compound workflows without introducing new failure modes or requiring model changes.

What would settle it

A controlled test in which the agent completes a compound workflow that violates an active policy, such as executing a restricted tool or exposing protected information, while all five checkpoints remain active.

read the original abstract

Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear engineering demo of a five-checkpoint policy layer for steering generalist agents without fine-tuning, but it rests on untested assumptions about LLM compliance and provides no data.

read the letter

The main point is that the paper describes CUGA as a modular policy-as-code system that intercepts a generalist LLM agent at five fixed stages: Intent Guard before planning, Playbook and Tool Guide inside the prompt, Tool Approvals as a human gate, and Output Formatter at the end. The healthcare demo shows how these can enforce structured sequences and block bad intents in a compound workflow. That composition is the concrete contribution here, and it is presented as something you can drop onto an existing agent rather than rebuild it per domain. The description is straightforward and the checkpoint breakdown makes the runtime flow easy to picture for someone who has tried to add governance after the fact. It does a decent job laying out typed primitives that could speed up deployment in regulated settings. The soft spots are straightforward and not minor. The entire claim of predictable, auditable behavior without model changes depends on the LLM reliably following the injected prompts and guards across multi-step runs. The paper gives only a qualitative scenario and no adherence numbers, error cases, or comparison to a baseline without the layer. That leaves the central assumption unexamined, especially since generalist models are known to drift on long contexts. There are also no details on how the policies themselves are authored or maintained at scale. This paper is for practitioners who need a working pattern for enterprise agent governance right now and are willing to add their own tests. Readers chasing new theory or reproducible results will not get much. It is worth sending to a serious referee so the authors can be asked for evaluation data and failure analysis, but it would need that work before it stands on its own.

Referee Report

1 major / 1 minor

Summary. The paper introduces CUGA, a modular policy-as-code layer designed to provide governance for generalist LLM agents. It enforces policy interventions at five structural checkpoints—Intent Guard, Playbook, Tool Guide, Tool Approvals, and Output Formatter—to achieve predictable, auditable, and compliance-aware behavior in compound workflows without requiring model fine-tuning. The approach is illustrated with a qualitative healthcare demo scenario demonstrating dynamic playbook injection, intent guards, and human-in-the-loop approvals.

Significance. The proposed architecture addresses an important practical challenge in deploying autonomous agents in enterprise environments by embedding governance directly into the agent's execution pipeline. The use of typed governance primitives and the modular composition with existing generalist agents is a notable strength. However, as a primarily descriptive demo without quantitative evaluation, its significance depends on future validation of the reliability of the prompt-based interventions.

major comments (1)

Abstract: The central claim that the system delivers 'predictable, auditable, and compliance-aware behavior' and improves 'policy adherence and execution consistency' is load-bearing but unsupported by any metrics, error analysis, or adherence rates from the healthcare scenario. The description of the five checkpoints relies on the assumption that the generalist LLM will consistently follow the injected constraints, yet no evidence or analysis is provided to address known issues like instruction drift in long workflows.

minor comments (1)

The manuscript would benefit from clearer notation or diagrams illustrating the flow through the five checkpoints to aid reader understanding of the runtime architecture.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for recognizing the practical importance of embedding governance into generalist agent pipelines and for the constructive feedback on empirical support. We agree that the current work is a descriptive demonstration of the CUGA architecture rather than a quantitative study, and we will revise the manuscript to qualify claims and address limitations explicitly.

read point-by-point responses

Referee: Abstract: The central claim that the system delivers 'predictable, auditable, and compliance-aware behavior' and improves 'policy adherence and execution consistency' is load-bearing but unsupported by any metrics, error analysis, or adherence rates from the healthcare scenario. The description of the five checkpoints relies on the assumption that the generalist LLM will consistently follow the injected constraints, yet no evidence or analysis is provided to address known issues like instruction drift in long workflows.

Authors: We acknowledge that the abstract presents the intended outcomes of the architecture in strong terms without accompanying quantitative metrics or error analysis from the healthcare demo. This manuscript is positioned as an architectural demonstration illustrating how typed policy primitives can be composed at five structural checkpoints; the scenario shows dynamic playbook injection, intent blocking, and human-in-the-loop approvals in a qualitative setting. The multi-stage design is intended to reduce dependence on any single prompt by adding external enforcement layers (Tool Approvals, Output Formatter) that operate outside the LLM reasoning loop. Nevertheless, we agree that explicit discussion of instruction drift and other LLM reliability issues is warranted. We will revise the abstract to frame the benefits as 'designed to enable' rather than 'delivers,' qualify the healthcare example as illustrative, and add a dedicated limitations subsection that discusses prompt-based intervention reliability, potential drift in extended workflows, and the need for future quantitative evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: system design description without derivations or self-referential claims

full rationale

The manuscript presents CUGA as a modular policy-as-code architecture that intercepts generalist LLM agents at five named checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter). No equations, fitted parameters, predictions, or first-principles derivations appear anywhere in the provided text. All claims rest on descriptive system composition and a qualitative healthcare scenario rather than any reduction of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is therefore self-contained as an engineering artifact; the absence of a derivation chain precludes any circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about LLM agent behavior and introduces no free parameters, new entities, or non-standard axioms beyond the described policy components.

axioms (1)

domain assumption Generalist LLM agents can be effectively steered and constrained through external runtime interventions at planning, prompting, tool, approval, and output stages.
This assumption underpins the claim that the five checkpoints deliver governance without model fine-tuning.

pith-pipeline@v0.9.0 · 5818 in / 1182 out tokens · 35908 ms · 2026-05-21T04:48:54.434453+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Zhaorun Chen, Mintong Kang, and Bo Li. 2025. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738(2025)

work page arXiv 2025
[2]

CUGA Project. 2026. CUGA: Computer-Using Generalist Agent. https://github. com/cuga-project/cuga-agent. Accessed: 2026-04-27

work page 2026
[3]

CUGA Project. 2026. OAK Bench: Customer-Care Benchmark for Insurance Workflows. https://github.com/cuga-project/oak-bench. Accessed: 2026-04-27

work page 2026
[4]

Suyash Gaurav, Jukka Heikkonen, and Jatin Chaudhary. 2025. Governance- as-a-service: A multi-agent framework for ai system compliance and policy enforcement.arXiv preprint arXiv:2508.18765(2025)

work page arXiv 2025
[5]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

work page
[6]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

work page
[7]

IBM Research. 2026. BPO-Bench: Business Process Operations Benchmark. https: //huggingface.co/datasets/ibm-research/BPO-Bench. Accessed: 2026-04-27

work page 2026
[8]

Changyue Jiang, Xudong Pan, and Min Yang. 2025. Think twice before you act: Enhancing agent behavioral safety with thought correction.arXiv preprint arXiv:2505.11063(2025)

work page arXiv 2025
[9]

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov

work page
[10]

St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents.arXiv preprint arXiv:2410.06703(2024)

work page arXiv 2024
[11]

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. 2025. Agentauditor: Human-level safety and security evaluation for llm agents.arXiv preprint arXiv:2506.00641(2025)

work page arXiv 2025
[12]

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. 2026. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security8, 3-4 (2026), 1–240

work page 2026
[13]

Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif. 2025. Towards enterprise-ready computer using generalist agent.arXiv preprint arXiv:2503.01861(2025)

work page arXiv 2025
[14]

Alon Oved, Segev Shlomov, Sergey Zeltyn, Nir Mashkif, and Avi Yaeli. 2025. SNAP: semantic stories for next activity prediction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28871–28877

work page 2025
[15]

Parlant. 2025. Parlant: Conversational Control Layer for Customer-Facing LLM Agents. https://parlant.io/ and https://github.com/emcie-co/parlant. Accessed March 2026

work page 2025
[16]

Sivan Schwartz, Avi Yaeli, and Segev Shlomov. 2023. Enhancing trust in LLM- based AI automation agents: New considerations and future challenges.arXiv preprint arXiv:2308.05391(2023)

work page arXiv 2023
[17]

Yucheng Shi, Wenhao Yu, Jingyuan Huang, Wenlin Yao, Wenhu Chen, and Ninghao Liu. 2025. Towards trustworthy gui agents: A survey.arXiv preprint arXiv:2503.23434(2025)

work page arXiv 2025
[18]

Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Łukasz Strąk, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, et al. 2026. From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enter- prise Production. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 40423–40431

work page 2026
[19]

Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, and Roy Abit- bol. 2025. From Grounding to Planning: Benchmarking Bottlenecks in Web Agents. InECAI 2025 – 28th European Conference on Artificial Intelligence (Fron- tiers in Artificial Intelligence and Applications, Vol. 413). IOS Press, 4815–4822. arXiv:2409.01927 [cs.AI]

work page arXiv 2025
[20]

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian

work page
[21]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Appworld: A controllable world of apps and people for benchmarking inter- active coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16022–16076

work page
[22]

Lillian Tsai and Eugene Bagdasarian. 2025. Contextual agent security: A policy for every purpose. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems. 8–17

work page 2025
[23]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

work page
[24]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems33 (2020), 5776–5788

work page 2020
[25]

Avi Yaeli, Segev Shlomov, Alon Oved, Sergey Zeltyn, and Nir Mashkif. 2022. Recommending next best skill in conversational robotic process automation. In International Conference on Business Process Management. Springer, 215–230

work page 2022
[26]

Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, and Jun Zhu. 2025. Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments.arXiv preprint arXiv:2506.01616(2025)

work page arXiv 2025
[27]

Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Wenxin Zhang, Quanchen Zou, Junzheng Shi, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, et al . 2025. Securewebarena: A holistic security evaluation benchmark for lvlm-based web agents.arXiv preprint arXiv:2510.10073(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Sergey Zeltyn, Segev Shlomov, Avi Yaeli, and Alon Oved. 2022. Prescriptive process monitoring in intelligent process automation with chatbot orchestration. arXiv preprint arXiv:2212.06564(2022)

work page arXiv 2022
[29]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

total requi- sitions used for computation

Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, and Ateret Anaby Tavor. 2025. Towards Enforcing Company Policy Adherence in Agentic Workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 595–606. A Ablation Study: BPO Benchmark This appendix provides the complete experiment...

work page 2025

[1] [1]

Zhaorun Chen, Mintong Kang, and Bo Li. 2025. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738(2025)

work page arXiv 2025

[2] [2]

CUGA Project. 2026. CUGA: Computer-Using Generalist Agent. https://github. com/cuga-project/cuga-agent. Accessed: 2026-04-27

work page 2026

[3] [3]

CUGA Project. 2026. OAK Bench: Customer-Care Benchmark for Insurance Workflows. https://github.com/cuga-project/oak-bench. Accessed: 2026-04-27

work page 2026

[4] [4]

Suyash Gaurav, Jukka Heikkonen, and Jatin Chaudhary. 2025. Governance- as-a-service: A multi-agent framework for ai system compliance and policy enforcement.arXiv preprint arXiv:2508.18765(2025)

work page arXiv 2025

[5] [5]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

work page

[6] [6]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

work page

[7] [7]

IBM Research. 2026. BPO-Bench: Business Process Operations Benchmark. https: //huggingface.co/datasets/ibm-research/BPO-Bench. Accessed: 2026-04-27

work page 2026

[8] [8]

Changyue Jiang, Xudong Pan, and Min Yang. 2025. Think twice before you act: Enhancing agent behavioral safety with thought correction.arXiv preprint arXiv:2505.11063(2025)

work page arXiv 2025

[9] [9]

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov

work page

[10] [10]

St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents.arXiv preprint arXiv:2410.06703(2024)

work page arXiv 2024

[11] [11]

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. 2025. Agentauditor: Human-level safety and security evaluation for llm agents.arXiv preprint arXiv:2506.00641(2025)

work page arXiv 2025

[12] [12]

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. 2026. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security8, 3-4 (2026), 1–240

work page 2026

[13] [13]

Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif. 2025. Towards enterprise-ready computer using generalist agent.arXiv preprint arXiv:2503.01861(2025)

work page arXiv 2025

[14] [14]

Alon Oved, Segev Shlomov, Sergey Zeltyn, Nir Mashkif, and Avi Yaeli. 2025. SNAP: semantic stories for next activity prediction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28871–28877

work page 2025

[15] [15]

Parlant. 2025. Parlant: Conversational Control Layer for Customer-Facing LLM Agents. https://parlant.io/ and https://github.com/emcie-co/parlant. Accessed March 2026

work page 2025

[16] [16]

Sivan Schwartz, Avi Yaeli, and Segev Shlomov. 2023. Enhancing trust in LLM- based AI automation agents: New considerations and future challenges.arXiv preprint arXiv:2308.05391(2023)

work page arXiv 2023

[17] [17]

Yucheng Shi, Wenhao Yu, Jingyuan Huang, Wenlin Yao, Wenhu Chen, and Ninghao Liu. 2025. Towards trustworthy gui agents: A survey.arXiv preprint arXiv:2503.23434(2025)

work page arXiv 2025

[18] [18]

Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Łukasz Strąk, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, et al. 2026. From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enter- prise Production. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 40423–40431

work page 2026

[19] [19]

Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, and Roy Abit- bol. 2025. From Grounding to Planning: Benchmarking Bottlenecks in Web Agents. InECAI 2025 – 28th European Conference on Artificial Intelligence (Fron- tiers in Artificial Intelligence and Applications, Vol. 413). IOS Press, 4815–4822. arXiv:2409.01927 [cs.AI]

work page arXiv 2025

[20] [20]

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian

work page

[21] [21]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Appworld: A controllable world of apps and people for benchmarking inter- active coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16022–16076

work page

[22] [22]

Lillian Tsai and Eugene Bagdasarian. 2025. Contextual agent security: A policy for every purpose. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems. 8–17

work page 2025

[23] [23]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

work page

[24] [24]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems33 (2020), 5776–5788

work page 2020

[25] [25]

Avi Yaeli, Segev Shlomov, Alon Oved, Sergey Zeltyn, and Nir Mashkif. 2022. Recommending next best skill in conversational robotic process automation. In International Conference on Business Process Management. Springer, 215–230

work page 2022

[26] [26]

Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, and Jun Zhu. 2025. Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments.arXiv preprint arXiv:2506.01616(2025)

work page arXiv 2025

[27] [27]

Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Wenxin Zhang, Quanchen Zou, Junzheng Shi, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, et al . 2025. Securewebarena: A holistic security evaluation benchmark for lvlm-based web agents.arXiv preprint arXiv:2510.10073(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Sergey Zeltyn, Segev Shlomov, Avi Yaeli, and Alon Oved. 2022. Prescriptive process monitoring in intelligent process automation with chatbot orchestration. arXiv preprint arXiv:2212.06564(2022)

work page arXiv 2022

[29] [29]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

total requi- sitions used for computation

Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, and Ateret Anaby Tavor. 2025. Towards Enforcing Company Policy Adherence in Agentic Workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 595–606. A Ablation Study: BPO Benchmark This appendix provides the complete experiment...

work page 2025