arxiv: 2403.07718 · v5 · submitted 2024-03-12 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin , Maxime Gasse , Massimo Caccia , Issam H. Laradji , Manuel Del Verme , Tom Marty , L\'eo Boisvert , Megh Thakkar

show 4 more authors

Quentin Cappart David Vazquez Nicolas Chapados Alexandre Lacoste

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords web agentsLLM agentsbenchmarkstask automationenterprise softwareServiceNowBrowserGym

0 comments

The pith

Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorkArena as a benchmark of 33 tasks drawn from the ServiceNow platform to measure how well LLM-based agents can carry out typical daily activities of knowledge workers through web browsers. It also presents BrowserGym, an environment that supplies a rich set of actions and multimodal observations for building and testing such agents. Empirical tests reveal that current agents achieve partial success yet remain far from completing the full set of tasks automatically. The results further show a clear performance difference, with closed-source models outperforming open-source ones. The benchmark is intended to guide development toward agents that can handle real enterprise software work.

Core claim

We propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs.

What carries the argument

WorkArena benchmark of 33 ServiceNow tasks together with BrowserGym environment supplying actions and multimodal observations for agent design and evaluation

If this is right

Agents can manage some simpler browser-based tasks but fail on more involved ones.
Closing the performance gap between open and closed-source models is a priority for progress.
The benchmark can be used to track improvements in agent capabilities over time.
Full automation of knowledge-work tasks is not yet feasible with existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to other enterprise platforms would test whether the observed gaps are specific to ServiceNow or more general.
The results point to a need for better ways to handle long sequences of actions and complex interfaces in agent training.
Access to proprietary model capabilities currently provides a practical advantage for deploying web agents in work settings.

Load-bearing premise

The 33 tasks chosen for WorkArena are representative of the typical daily work of knowledge workers utilizing enterprise software systems.

What would settle it

Demonstrating near-complete success rates on the 33 WorkArena tasks with new open-source models or on a wider set of real enterprise workflows would challenge the claim of a considerable remaining gap.

read the original abstract

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorkArena supplies a new benchmark and environment for web agents on ServiceNow tasks that is more enterprise-oriented than prior tests, but the 33 tasks lack clear external validation for representativeness.

read the letter

WorkArena and BrowserGym are the concrete additions here. The benchmark covers 33 tasks inside the ServiceNow platform that are meant to reflect multi-step knowledge work, and the environment supplies a richer action space plus multimodal observations than many earlier web setups. This is a useful move because most previous agent evaluations stayed on simpler public sites that do not match the complexity of actual business software.

Referee Report

1 major / 2 minor

Summary. The paper introduces WorkArena, a remote-hosted benchmark of 33 tasks drawn from the ServiceNow enterprise platform, together with the BrowserGym environment that supplies a rich action space and multimodal observations. Through empirical evaluation of LLM-based web agents, it claims that current agents show promise on these tasks yet exhibit a considerable gap to full automation, with a significant performance disparity between open- and closed-source models.

Significance. If the 33 tasks prove representative of typical knowledge-worker workflows, the benchmark and environment would supply a concrete, reproducible testbed for measuring progress toward practical web agents in enterprise settings. The reported open/closed-source gap would also constitute a falsifiable observation that could guide model development priorities.

major comments (1)

[Task construction / benchmark definition] The central claim of a 'considerable gap towards achieving full task automation' and the open/closed-source disparity both rest on the assumption that the 33 ServiceNow tasks are representative of daily knowledge work. The manuscript describes the tasks as 'based on the widely-used ServiceNow platform' but supplies no usage-log frequency analysis, expert coverage survey, or cross-platform sampling to support this representativeness (see abstract and the task-construction description).

minor comments (2)

[Evaluation] The abstract states that the evaluation 'reveals' the performance gap but does not specify success metrics, statistical tests, or controls; these details should be stated explicitly in the evaluation section.
[BrowserGym environment] BrowserGym's action set and observation modalities are described at a high level; a table enumerating the exact actions and observation channels would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding task representativeness below. While we cannot retroactively add proprietary usage logs, we can strengthen the justification for task selection and clarify the scope of our claims.

read point-by-point responses

Referee: The central claim of a 'considerable gap towards achieving full task automation' and the open/closed-source disparity both rest on the assumption that the 33 ServiceNow tasks are representative of daily knowledge work. The manuscript describes the tasks as 'based on the widely-used ServiceNow platform' but supplies no usage-log frequency analysis, expert coverage survey, or cross-platform sampling to support this representativeness (see abstract and the task-construction description).

Authors: We agree that a formal frequency analysis or expert survey would strengthen the claim of representativeness and note its absence as a limitation. The 33 tasks were curated by the authors (who have direct experience with ServiceNow deployments) to cover core, recurring operations in IT service management, HR, and knowledge workflows that are documented as standard in ServiceNow's own user guides and industry reports (e.g., incident creation, knowledge article search, user provisioning). ServiceNow is deployed in over 10,000 organizations and these task types align with publicly available case studies of daily knowledge-worker activity. The performance gap and open/closed-source disparity are reported as empirical observations on this benchmark rather than universal claims about all knowledge work; we will revise the abstract and task-construction section to explicitly frame the benchmark as a representative but non-exhaustive sample of enterprise web tasks and add references to ServiceNow documentation. No cross-platform sampling was performed because the benchmark deliberately targets a single widely-used platform to enable reproducible, remote-hosted evaluation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

This paper introduces the WorkArena benchmark and BrowserGym environment, then reports direct empirical performance measurements of LLM-based agents on 33 ServiceNow tasks. No mathematical derivations, equations, parameter fitting, or predictive claims exist that could reduce to inputs by construction. The central findings (performance gap to full automation and open/closed-source LLM disparity) rest on observed success rates rather than any self-definitional loop, fitted-input renaming, or load-bearing self-citation chain. Task selection and representativeness are stated assumptions open to external validation, but they do not create circularity within the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the representativeness of the 33 tasks and the fairness of the agent evaluation protocol; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 33 tasks in WorkArena represent typical daily knowledge work on enterprise platforms.
Assumption invoked to justify the benchmark's relevance to real-world use; no external validation or sampling justification is provided in the abstract.

pith-pipeline@v0.9.0 · 5476 in / 1152 out tokens · 25678 ms · 2026-05-15T02:44:17.853901+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
cs.MA 2026-05 unverdicted novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
cs.CR 2026-04 unverdicted novelty 7.0

WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
cs.AI 2024-05 accept novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
cs.AI 2026-05 unverdicted novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 unverdicted novelty 6.0

NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
cs.CR 2026-05 unverdicted novelty 6.0

LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Agent Workflow Memory
cs.CL 2024-09 unverdicted novelty 6.0

AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation
cs.AI 2026-05 unverdicted novelty 3.0

Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 21 Pith papers · 8 internal anchors

[1]

The unsolved challenges of LLM s in open-ended web tasks: A case study

Assouel, R., Marty, T., Caccia, M., Laradji, I., Drouin, A., Rajeswar, S., Palacios, H., Cappart, Q., Vazquez, D., Chapados, N., Gasse, M., and Lacoste, A. The unsolved challenges of LLM s in open-ended web tasks: A case study. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=jt3il4fC5B

work page 2023
[2]

OpenAI gym, 2016

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym, 2016

work page 2016
[3]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2Web : Towards a generalist agent for the web. arXiv, abs/2306.06070, 2023

work page arXiv 2023
[4]

S., and Gur, I

Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y., Gu, S. S., and Gur, I. Multimodal web navigation with instruction-finetuned foundation models. arXiv, abs/2305.11854, 2023. URL https://arxiv.org/abs/2305.11854

work page arXiv 2023
[5]

Chrome devtools protocol, 2023

Google . Chrome devtools protocol, 2023. URL https://chromedevtools.github.io/devtools-protocol/

work page 2023
[7]

A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck, D., and Faust, A. A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv, abs/2307.12856, 2023 b . URL https://arxiv.org/abs/2307.12856

work page arXiv 2023
[8]

Webvoyager: Building an end-to-end web agent with large multimodal models,

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. WebVoyager : Building an end-to-end web agent with large multimodal models. arXiv, abs/2401.13919, 2024. URL https://arxiv.org/abs/2401.13919

work page arXiv 2024
[9]

Automatic macro mining from interaction traces at scale

Huang, F., Li, G., Li, T., and Li, Y. Automatic macro mining from interaction traces at scale. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI '24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703300. doi:10.1145/3613904.3642074. URL https://doi.org/10.1145/3613904.3642074

work page doi:10.1145/3613904.3642074 2024
[10]

C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P., Santoro, A., and Lillicrap, T

Humphreys, P. C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P., Santoro, A., and Lillicrap, T. A data-driven approach for learning to control computers. In International Conference on Machine Learning (ICML), 2022

work page 2022
[11]

Language models can solve computer tasks

Kim, G., Baldi, P., and McAleer, S. Language models can solve computer tasks. arXiv, abs/2303.17491, 2023. URL https://arxiv.org/abs/2303.17491

work page arXiv 2023
[12]

Mapping natural language instructions to mobile ui action sequences

Li, Y., He, J., Zhou, X., Zhang, Y., and Baldridge, J. Mapping natural language instructions to mobile ui action sequences. In Annual Conference of the Association for Computational Linguistics (ACL 2020), 2020. URL https://www.aclweb.org/anthology/2020.acl-main.729.pdf

work page 2020
[13]

Z., Guu, K., Pasupat, P., Shi, T., and Liang, P

Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018

work page 2018
[14]

AgentBench: Evaluating LLMs as Agents

Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., and Tang, J. AgentBench : Evaluating LLMs as agents. arXiv, abs/2308.03688, 2023 a . URL https://arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Bo- laa: Benchmarking and orchestrating llm-augmented autonomous agents

Liu, Z., Yao, W., Zhang, J., Xue, L., Heinecke, S., Murthy, R., Feng, Y., Chen, Z., Niebles, J. C., Arpit, D., Xu, R., Mui, P., Wang, H., Xiong, C., and Savarese, S. BOLAA : Benchmarking and orchestrating LLM -augmented autonomous agents. arXiv, abs/2308.05960, 2023 b

work page arXiv 2023
[16]

H., Kasner, Z., and Reddy, S

L \`u , X. H., Kasner, Z., and Reddy, S. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024

work page arXiv 2024
[17]

Knowledge 2020: ``The digital workflow revolution has just begun''

Maas, M. Knowledge 2020: ``The digital workflow revolution has just begun'' . Technical report, Sprinklr, 2020. URL https://www.linkedin.com/pulse/knowledge-2020-digital-workflow-revolution-has-just-begun-maas/

work page 2020
[18]

ServiceNow joins the prestigious Fortune 500 list

Mastantuono, G. ServiceNow joins the prestigious Fortune 500 list . https://www.servicenow.com/blogs/2023/servicenow-joins-fortune-500-list.html, 2023. Accessed: 2024-01-29

work page 2023
[19]

Llama 3: Meta's latest large language model

Meta. Llama 3: Meta's latest large language model. https://github.com/meta-llama/llama3, 2024. Accessed: 2024-06-03

work page 2024
[20]

Playwright for P ython documentation, 2023

Microsoft . Playwright for P ython documentation, 2023. URL https://playwright.dev/python/

work page 2023
[21]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., and Schulman, J. WebGPT : Browser-assisted question-answering with human feedback. arXiv, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Androidinthewild: A large-scale dataset for android device control

Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Androidinthewild: A large-scale dataset for android device control. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 59708--59728, 2023

work page 2023
[24]

Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles

SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles . Technical report, Society of Automotive Engineers (SAE), 04 2021. URL https://doi.org/10.4271/J3016_202104

work page doi:10.4271/j3016_202104 2021
[25]

Vancouver release notes

ServiceNow . Vancouver release notes. Online, 2023. Available at: https://docs.servicenow.com/bundle/vancouver-release-notes/

work page 2023
[26]

World of bits: An open-domain platform for web-based agents

Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning (ICML), 2017 a

work page 2017
[27]

World of bits: An open-domain platform for web-based agents

Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. ICML, 2017 b

work page 2017
[28]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Kh...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

A journey into the future of the translation industry, 2021

van der Meer, J. A journey into the future of the translation industry, 2021. URL https://www.taus.net/resources/blog/a-journey-into-the-future-of-the-translation-industry. Accessed: 2024-02-01

work page 2021
[30]

Emergent Abilities of Large Language Models

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

V., and Zhou, D

Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 24824--24837. Curran Associates, Inc...

work page 2022
[32]

J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., and Yu, T

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., and Yu, T. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024
[33]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Yang, J., Zhang, H., Li, F., Zou, X., Li, C., and Gao, J. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

WebShop : Towards scalable real-world web interaction with grounded language agents

Yao, S., Chen, H., Yang, J., and Narasimhan, K. WebShop : Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[35]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. ReAct : Synergizing reasoning and acting in language models. arXiv, abs/2210.03629, 2023. URL https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Agenttuning: Enabling generalized agent abilities for llms,

Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y., and Tang, J. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

work page arXiv 2023
[37]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., and Neubig, G. Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023. URL https://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2023