arxiv: 2605.14290 · v1 · submitted 2026-05-14 · 💻 cs.CR · cs.AI· cs.CL· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Web Agents Should Adopt the Plan-Then-Execute Paradigm

Julien Piet , Annabella Chow , Yiwei Hou , Muxi Lyu , Sylvie Venuto , Jinhao Zhu , Raluca Ada Popa , David Wagner

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.SE

keywords web agentsplan-then-executeReActprompt injectionWebArenasemantic actionsbrowser tools

0 comments

The pith

Web agents should commit to a task-specific program before observing runtime web content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReAct has become the default for LLM agents but exposes web agents to prompt injections since page content from sellers, reviews, and ads can steer decisions. The paper argues for plan-then-execute as the default, where agents commit to a complete task program before seeing any web content and then follow it strictly. This way untrusted data can only affect internal values or branches but cannot change the overall task or create new actions. Analysis of WebArena shows all tasks fit this model and 80 percent can use a purely programmatic plan without runtime model calls. The barrier is the web's low-level browser tools, which require new typed semantic interfaces to enable effective upfront planning.

Core claim

The paper's central claim is that web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties, creating a direct path for prompt injections under ReAct. Plan-then-execute changes the boundary so untrusted data cannot redefine the user task. All WebArena tasks are compatible with this approach, while 80% can be completed with a purely programmatic plan without any runtime LLM subroutine.

What carries the argument

The plan-then-execute paradigm that commits to a fixed execution graph before any runtime observations of web content.

If this is right

Web agents resist control-flow hijacking from prompt injections in mixed-content pages.
Most web tasks in benchmarks like WebArena can be handled without runtime LLM decisions.
Development effort should target typed website APIs that expose semantic actions with known effects.
Agents become more auditable since their behavior follows a predefined program.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Creating standardized task-level APIs for popular sites would make plan-then-execute practical across the web.
The same separation of planning and execution could improve security in other agent settings with untrusted data sources.
Empirical tests measuring injection success rates under both paradigms would quantify the security benefit.

Load-bearing premise

That web tasks do not require reactivity by default and that tools can be made to map cleanly to semantic actions with effects known before execution.

What would settle it

A web task that requires synthesizing new actions based on runtime observations or a successful prompt injection that redefines the task despite an upfront committed plan.

Figures

Figures reproduced from arXiv: 2605.14290 by Annabella Chow, David Wagner, Jinhao Zhu, Julien Piet, Muxi Lyu, Raluca Ada Popa, Sylvie Venuto, Yiwei Hou.

read the original abstract

ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper argues that ReAct is the wrong default architecture for web agents because untrusted web content from multiple parties creates direct paths for prompt injections to alter control flow. It proposes plan-then-execute as the default: commit to a task-specific program before observing runtime content, so that untrusted data can only affect values or branches inside a predefined graph but cannot redefine the task or synthesize new actions. Analysis of WebArena shows all tasks are compatible with this paradigm and 80% can be completed with purely programmatic plans without runtime LLM subroutines. The main barrier is that browser tools (click, type, scroll) have page-dependent semantics, making planning myopic; the solution is typed, task-level website APIs rather than low-level actions.

Significance. If the empirical classification holds, the paper offers a clear security distinction between architectures and correctly reframes web-agent robustness as an infrastructure problem. It provides a falsifiable claim about WebArena task compatibility and identifies a concrete interface gap that, if closed, would enable auditable, less reactive agents. This is a substantive contribution to the security of LLM agents on the open web.

major comments (1)

[WebArena analysis] WebArena analysis section: the claim that 80% of tasks admit purely programmatic plans (and all are compatible) lacks any stated criteria for classifying a plan as 'purely programmatic,' for determining when runtime page content may still alter control flow or data values, or for how the 80% figure was obtained. Without reproducible inspection rules or task-by-task breakdown, the load-bearing empirical support for the security advantage over ReAct cannot be verified.

minor comments (2)

[Abstract and §1] The abstract and introduction use 'program' and 'execution graph' without an early formal definition or example of what constitutes a static versus reactive plan.
[Results] Figure or table presenting the WebArena breakdown (if present) should include explicit columns for classification criteria and inter-rater agreement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the paper's core security distinction between ReAct and plan-then-execute. We address the single major comment below by providing the missing classification criteria and committing to a reproducible breakdown in the revision.

read point-by-point responses

Referee: [WebArena analysis] WebArena analysis section: the claim that 80% of tasks admit purely programmatic plans (and all are compatible) lacks any stated criteria for classifying a plan as 'purely programmatic,' for determining when runtime page content may still alter control flow or data values, or for how the 80% figure was obtained. Without reproducible inspection rules or task-by-task breakdown, the load-bearing empirical support for the security advantage over ReAct cannot be verified.

Authors: We agree that explicit criteria and a task-level breakdown are necessary for verifiability. In the revised manuscript we will add a new subsection (and appendix) that defines: (1) a plan is 'purely programmatic' if it consists of a fixed sequence of typed actions whose control flow contains no runtime LLM calls for branching or action synthesis—data values may be filled from page content but the task graph itself is committed before any observation; (2) a task is 'compatible' with plan-then-execute if its success predicate can be expressed as a static program whose only runtime inputs are value bindings inside that program (no redefinition of the user intent or invention of new actions); (3) the 80% figure was obtained by exhaustive manual review of all 812 WebArena tasks, classifying each according to whether its gold trajectory could be realized by such a static program. We will include the full per-task classification table (or a representative sample with clear rules) so that the empirical claim can be independently reproduced. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural argument backed by independent benchmark inspection

full rationale

The paper advances a design recommendation (plan-then-execute as default) on security grounds and supports it with a direct inspection of WebArena tasks. No equations, fitted parameters, or self-citations appear in the derivation. The compatibility claim is stated as an empirical observation from task review rather than a quantity derived from prior model outputs or definitions inside the paper. The argument therefore remains self-contained and does not reduce any central result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about web content structure and benchmark compatibility; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Web content mixes inputs from many parties, creating a direct path for prompt injections under ReAct.
Core premise stated in the abstract as the reason ReAct is unsuitable.
domain assumption All WebArena tasks are compatible with plan-then-execute and 80% can be completed with purely programmatic plans.
Result of the paper's benchmark analysis invoked to support the paradigm recommendation.

pith-pipeline@v0.9.0 · 5651 in / 1321 out tokens · 44813 ms · 2026-05-15T02:39:54.993875+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content... all WebArena tasks are compatible with plan-then-execute, while 81.28% can be completed with a purely programmatic plan
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 10 internal anchors

[1]

IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents.arXiv preprint arXiv:2508.15310, 2025

Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, and Shouling Ji. IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents.arXiv preprint arXiv:2508.15310, 2025

work page arXiv 2025
[2]

Claude code auto mode: a safer way to skip permissions

Anthropic. Claude code auto mode: a safer way to skip permissions. https://www. anthropic.com/engineering/claude-code-auto-mode, 2026. Accessed: 2026-05-06

work page 2026
[3]

Design patterns for securing llm agents against prompt injections.arXiv preprint arXiv:2506.08837, 2025

Luca Beurer-Kellner, Beat Buesser, Ana-Maria Cre¸ tu, Edoardo Debenedetti, Daniel Dobos, Daniel Fabian, Marc Fischer, David Froelicher, Kathrin Grosse, Daniel Naeff, et al. Design patterns for securing llm agents against prompt injections.arXiv preprint arXiv:2506.08837, 2025

work page arXiv 2025
[4]

StruQ : Defending Against Prompt Injection with Structured Queries , September 2024

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries.arXiv preprint arXiv:2402.06363, 2024

work page arXiv 2024
[5]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. arXiv preprint arXiv:2410.05451, 2024

work page arXiv 2024
[6]

Securing AI Agents with Information-Flow Control

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Securing AI agents with information-flow control.arXiv preprint arXiv:2505.23643, 2025

work page internal anchor Pith review arXiv 2025
[7]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[10]

Wasp: Benchmarking web agent security against prompt injection attacks,

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, and Kamalika Chaud- huri. W ASP: Benchmarking web agent security against prompt injection attacks.arXiv preprint arXiv:2504.18575, 2025

work page arXiv 2025
[11]

Camels can use computers too: System-level security for computer use agents.arXiv preprint arXiv:2601.09923, 2026

Hanna Foerster, Tom Blanchard, Kristina Nikoli´c, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, and Yiren Zhao. Camels can use computers too: System-level security for computer use agents.arXiv preprint arXiv:2601.09923, 2026

work page arXiv 2026
[12]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

work page 2023
[13]

OpAgent: Operator Agent for Web Navigation

Yuyu Guo, Wenjie Yang, Siyuan Yang, Ziyang Liu, Cheng Chen, Yuan Wei, Yun Hu, Yang Huang, Guoliang Hao, Dongsheng Yuan, et al. Opagent: Operator agent for web navigation. arXiv preprint arXiv:2602.13559, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

The task shield: Enforcing task alignment to defend against indirect prompt injection in LLM agents.arXiv preprint arXiv:2412.16682, 2025

Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. The task shield: Enforcing task alignment to defend against indirect prompt injection in LLM agents.arXiv preprint arXiv:2412.16682, 2025

work page arXiv 2025
[15]

Optimizing agent planning for security and autonomy.https://openreview.net/forum?id=g0aVCDY3gS, 2026

Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, and Santiago Zanella-Béguelin. Optimizing agent planning for security and autonomy.https://openreview.net/forum?id=g0aVCDY3gS, 2026. ICLR 2026 poster

work page 2026
[16]

WAInjectBench: Benchmarking prompt injection detections for web agents.arXiv preprint arXiv:2510.01354, 2025

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, and Neil Zhenqiang Gong. WAInjectBench: Benchmarking prompt injection detections for web agents.arXiv preprint arXiv:2510.01354, 2025. 11

work page arXiv 2025
[17]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847, 2024

work page 2024
[18]

cellmate: Sandboxing browser ai agents.arXiv preprint arXiv:2512.12594, 2025

Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes. cellmate: Sandboxing browser ai agents.arXiv preprint arXiv:2512.12594, 2025

work page arXiv 2025
[19]

Browser use: Enable ai to control your browser

Magnus Müller and Gregor Žuni ˇc. Browser use: Enable ai to control your browser. https: //github.com/browser-use/browser-use, 2024

work page 2024
[20]

Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The attacker moves second: Stronger adaptive attacks bypass defenses against LLM jailbreaks and prompt injections.arXiv pre...

work page arXiv 2025
[21]

Computer-using agent: Introducing a universal interface for ai to interact with the digital world

OpenAI. Computer-using agent: Introducing a universal interface for ai to interact with the digital world. 2025

work page 2025
[22]

Owasp top 10 for large language model applications 2025

OWASP GenAI Security Project. Owasp top 10 for large language model applications 2025. https://genai.owasp.org/llm-top-10/, 2025. Accessed 2026-03-19

work page 2025
[23]

jatmo: Prompt injection defense by task-specific finetuning

Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. jatmo: Prompt injection defense by task-specific finetuning. In European Symposium on Research in Computer Security, 2024

work page 2024
[24]

All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask)

Edward J Schwartz, Thanassis Avgerinos, and David Brumley. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In2010 IEEE symposium on Security and privacy, pages 317–331. IEEE, 2010

work page 2010
[25]

Beyond browsing: API-based web agents.arXiv preprint arXiv:2410.16464, 2024

Yueqi Song, Frank Xu, Shuyan Zhou, and Graham Neubig. Beyond browsing: API-based web agents.arXiv preprint arXiv:2410.16464, 2024

work page arXiv 2024
[26]

Contextual agent security: A policy for every purpose

Lillian Tsai and Eugene Bagdasarian. Contextual agent security: A policy for every purpose. arXiv preprint arXiv:2501.17070, 2025. Also appeared at HotOS 2025

work page arXiv 2025
[27]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection.arXiv preprint arXiv:2508.01249, 2025

Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection.arXiv preprint arXiv:2508.01249, 2025

work page arXiv 2025
[29]

Webinject: Prompt injection attack to web agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, and Neil Zhenqiang Gong. Webinject: Prompt injection attack to web agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2010–2030, 2025

work page 2025
[30]

RL is a hammer and LLMs are nails: A simple reinforcement learning recipe for strong prompt injection.arXiv preprint arXiv:2510.04885, 2025

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, and Chuan Guo. RL is a hammer and LLMs are nails: A simple reinforcement learning recipe for strong prompt injection.arXiv preprint arXiv:2510.04885, 2025

work page arXiv 2025
[31]

Openagents: An open platform for language agents in the wild

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. Openagents: An open platform for language agents in the wild. InICLR Workshop on Large Language Model (LLM) Agents, 2024

work page 2024
[32]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Agentfold: Long-horizon web agents with proactive context folding

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context folding. InThe Fourteenth International Conference on Learning Representations, 2026. 12

work page 2026
[34]

Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

work page arXiv 2025
[35]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents.arXiv preprint arXiv:2403.02691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

API agents vs

Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, and Dongmei Zhang. API agents vs. GUI agents: Divergence and convergence.arXiv preprint arXiv:2503.11069, 2025

work page arXiv 2025
[37]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents.arXiv preprint arXiv:2410.02644, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Browsesafe: Understanding and preventing prompt injection within ai browser agents,

Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, and Ninghui Li. Browsesafe: Understanding and preventing prompt injection within AI browser agents.arXiv preprint arXiv:2511.20597, 2025

work page arXiv 2025
[39]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review arXiv 2024
[40]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

of interest

Jinhao Zhu, Kevin Tseng, Gil Vernik, Xiao Huang, Shishir G. Patil, Vivian Fang, and Raluca Ada Popa. Miniscope: A least privilege framework for authorizing tool calling agents.arXiv preprint arXiv:2512.11147, 2025. 13 Type Example Task Security Implication OneStopShop (38 cases) Semantic product filtering “Find cheapest dock with≥11 slots” Constraint not ...

work page arXiv 2025