Recognition: 1 theorem link
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3
The pith
Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs.
What carries the argument
WorkArena benchmark of 33 ServiceNow tasks together with BrowserGym environment supplying actions and multimodal observations for agent design and evaluation
If this is right
- Agents can manage some simpler browser-based tasks but fail on more involved ones.
- Closing the performance gap between open and closed-source models is a priority for progress.
- The benchmark can be used to track improvements in agent capabilities over time.
- Full automation of knowledge-work tasks is not yet feasible with existing methods.
Where Pith is reading between the lines
- Extending the benchmark to other enterprise platforms would test whether the observed gaps are specific to ServiceNow or more general.
- The results point to a need for better ways to handle long sequences of actions and complex interfaces in agent training.
- Access to proprietary model capabilities currently provides a practical advantage for deploying web agents in work settings.
Load-bearing premise
The 33 tasks chosen for WorkArena are representative of the typical daily work of knowledge workers utilizing enterprise software systems.
What would settle it
Demonstrating near-complete success rates on the 33 WorkArena tasks with new open-source models or on a wider set of real enterprise workflows would challenge the claim of a considerable remaining gap.
read the original abstract
We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WorkArena, a remote-hosted benchmark of 33 tasks drawn from the ServiceNow enterprise platform, together with the BrowserGym environment that supplies a rich action space and multimodal observations. Through empirical evaluation of LLM-based web agents, it claims that current agents show promise on these tasks yet exhibit a considerable gap to full automation, with a significant performance disparity between open- and closed-source models.
Significance. If the 33 tasks prove representative of typical knowledge-worker workflows, the benchmark and environment would supply a concrete, reproducible testbed for measuring progress toward practical web agents in enterprise settings. The reported open/closed-source gap would also constitute a falsifiable observation that could guide model development priorities.
major comments (1)
- [Task construction / benchmark definition] The central claim of a 'considerable gap towards achieving full task automation' and the open/closed-source disparity both rest on the assumption that the 33 ServiceNow tasks are representative of daily knowledge work. The manuscript describes the tasks as 'based on the widely-used ServiceNow platform' but supplies no usage-log frequency analysis, expert coverage survey, or cross-platform sampling to support this representativeness (see abstract and the task-construction description).
minor comments (2)
- [Evaluation] The abstract states that the evaluation 'reveals' the performance gap but does not specify success metrics, statistical tests, or controls; these details should be stated explicitly in the evaluation section.
- [BrowserGym environment] BrowserGym's action set and observation modalities are described at a high level; a table enumerating the exact actions and observation channels would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the concern regarding task representativeness below. While we cannot retroactively add proprietary usage logs, we can strengthen the justification for task selection and clarify the scope of our claims.
read point-by-point responses
-
Referee: The central claim of a 'considerable gap towards achieving full task automation' and the open/closed-source disparity both rest on the assumption that the 33 ServiceNow tasks are representative of daily knowledge work. The manuscript describes the tasks as 'based on the widely-used ServiceNow platform' but supplies no usage-log frequency analysis, expert coverage survey, or cross-platform sampling to support this representativeness (see abstract and the task-construction description).
Authors: We agree that a formal frequency analysis or expert survey would strengthen the claim of representativeness and note its absence as a limitation. The 33 tasks were curated by the authors (who have direct experience with ServiceNow deployments) to cover core, recurring operations in IT service management, HR, and knowledge workflows that are documented as standard in ServiceNow's own user guides and industry reports (e.g., incident creation, knowledge article search, user provisioning). ServiceNow is deployed in over 10,000 organizations and these task types align with publicly available case studies of daily knowledge-worker activity. The performance gap and open/closed-source disparity are reported as empirical observations on this benchmark rather than universal claims about all knowledge work; we will revise the abstract and task-construction section to explicitly frame the benchmark as a representative but non-exhaustive sample of enterprise web tasks and add references to ServiceNow documentation. No cross-platform sampling was performed because the benchmark deliberately targets a single widely-used platform to enable reproducible, remote-hosted evaluation. revision: partial
Circularity Check
No significant circularity in empirical benchmark evaluation
full rationale
This paper introduces the WorkArena benchmark and BrowserGym environment, then reports direct empirical performance measurements of LLM-based agents on 33 ServiceNow tasks. No mathematical derivations, equations, parameter fitting, or predictive claims exist that could reduce to inputs by construction. The central findings (performance gap to full automation and open/closed-source LLM disparity) rest on observed success rates rather than any self-definitional loop, fitted-input renaming, or load-bearing self-citation chain. Task selection and representativeness are stated assumptions open to external validation, but they do not create circularity within the reported results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 33 tasks in WorkArena represent typical daily knowledge work on enterprise platforms.
Forward citations
Cited by 22 Pith papers
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
-
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
-
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
Agent Workflow Memory
AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation
Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.
Reference graph
Works this paper leans on
-
[1]
The unsolved challenges of LLM s in open-ended web tasks: A case study
Assouel, R., Marty, T., Caccia, M., Laradji, I., Drouin, A., Rajeswar, S., Palacios, H., Cappart, Q., Vazquez, D., Chapados, N., Gasse, M., and Lacoste, A. The unsolved challenges of LLM s in open-ended web tasks: A case study. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=jt3il4fC5B
work page 2023
-
[2]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym, 2016
work page 2016
-
[3]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2Web : Towards a generalist agent for the web. arXiv, abs/2306.06070, 2023
-
[4]
Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y., Gu, S. S., and Gur, I. Multimodal web navigation with instruction-finetuned foundation models. arXiv, abs/2305.11854, 2023. URL https://arxiv.org/abs/2305.11854
-
[5]
Chrome devtools protocol, 2023
Google . Chrome devtools protocol, 2023. URL https://chromedevtools.github.io/devtools-protocol/
work page 2023
-
[7]
Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck, D., and Faust, A. A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv, abs/2307.12856, 2023 b . URL https://arxiv.org/abs/2307.12856
-
[8]
Webvoyager: Building an end-to-end web agent with large multimodal models,
He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. WebVoyager : Building an end-to-end web agent with large multimodal models. arXiv, abs/2401.13919, 2024. URL https://arxiv.org/abs/2401.13919
-
[9]
Automatic macro mining from interaction traces at scale
Huang, F., Li, G., Li, T., and Li, Y. Automatic macro mining from interaction traces at scale. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI '24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703300. doi:10.1145/3613904.3642074. URL https://doi.org/10.1145/3613904.3642074
-
[10]
Humphreys, P. C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P., Santoro, A., and Lillicrap, T. A data-driven approach for learning to control computers. In International Conference on Machine Learning (ICML), 2022
work page 2022
-
[11]
Language models can solve computer tasks
Kim, G., Baldi, P., and McAleer, S. Language models can solve computer tasks. arXiv, abs/2303.17491, 2023. URL https://arxiv.org/abs/2303.17491
-
[12]
Mapping natural language instructions to mobile ui action sequences
Li, Y., He, J., Zhou, X., Zhang, Y., and Baldridge, J. Mapping natural language instructions to mobile ui action sequences. In Annual Conference of the Association for Computational Linguistics (ACL 2020), 2020. URL https://www.aclweb.org/anthology/2020.acl-main.729.pdf
work page 2020
-
[13]
Z., Guu, K., Pasupat, P., Shi, T., and Liang, P
Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018
work page 2018
-
[14]
AgentBench: Evaluating LLMs as Agents
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., and Tang, J. AgentBench : Evaluating LLMs as agents. arXiv, abs/2308.03688, 2023 a . URL https://arxiv.org/abs/2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Bo- laa: Benchmarking and orchestrating llm-augmented autonomous agents
Liu, Z., Yao, W., Zhang, J., Xue, L., Heinecke, S., Murthy, R., Feng, Y., Chen, Z., Niebles, J. C., Arpit, D., Xu, R., Mui, P., Wang, H., Xiong, C., and Savarese, S. BOLAA : Benchmarking and orchestrating LLM -augmented autonomous agents. arXiv, abs/2308.05960, 2023 b
-
[16]
L \`u , X. H., Kasner, Z., and Reddy, S. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024
-
[17]
Knowledge 2020: ``The digital workflow revolution has just begun''
Maas, M. Knowledge 2020: ``The digital workflow revolution has just begun'' . Technical report, Sprinklr, 2020. URL https://www.linkedin.com/pulse/knowledge-2020-digital-workflow-revolution-has-just-begun-maas/
work page 2020
-
[18]
ServiceNow joins the prestigious Fortune 500 list
Mastantuono, G. ServiceNow joins the prestigious Fortune 500 list . https://www.servicenow.com/blogs/2023/servicenow-joins-fortune-500-list.html, 2023. Accessed: 2024-01-29
work page 2023
-
[19]
Llama 3: Meta's latest large language model
Meta. Llama 3: Meta's latest large language model. https://github.com/meta-llama/llama3, 2024. Accessed: 2024-06-03
work page 2024
-
[20]
Playwright for P ython documentation, 2023
Microsoft . Playwright for P ython documentation, 2023. URL https://playwright.dev/python/
work page 2023
-
[21]
WebGPT: Browser-assisted question-answering with human feedback
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., and Schulman, J. WebGPT : Browser-assisted question-answering with human feedback. arXiv, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Androidinthewild: A large-scale dataset for android device control
Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Androidinthewild: A large-scale dataset for android device control. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 59708--59728, 2023
work page 2023
-
[24]
Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles
SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles . Technical report, Society of Automotive Engineers (SAE), 04 2021. URL https://doi.org/10.4271/J3016_202104
-
[25]
ServiceNow . Vancouver release notes. Online, 2023. Available at: https://docs.servicenow.com/bundle/vancouver-release-notes/
work page 2023
-
[26]
World of bits: An open-domain platform for web-based agents
Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning (ICML), 2017 a
work page 2017
-
[27]
World of bits: An open-domain platform for web-based agents
Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. ICML, 2017 b
work page 2017
-
[28]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Kh...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
A journey into the future of the translation industry, 2021
van der Meer, J. A journey into the future of the translation industry, 2021. URL https://www.taus.net/resources/blog/a-journey-into-the-future-of-the-translation-industry. Accessed: 2024-02-01
work page 2021
-
[30]
Emergent Abilities of Large Language Models
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 24824--24837. Curran Associates, Inc...
work page 2022
-
[32]
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., and Yu, T. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024
work page 2024
-
[33]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., and Gao, J. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
WebShop : Towards scalable real-world web interaction with grounded language agents
Yao, S., Chen, H., Yang, J., and Narasimhan, K. WebShop : Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[35]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. ReAct : Synergizing reasoning and acting in language models. arXiv, abs/2210.03629, 2023. URL https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Agenttuning: Enabling generalized agent abilities for llms,
Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y., and Tang, J. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023
-
[37]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., and Neubig, G. Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023. URL https://arxiv.org/abs/2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.