pith. machine review for the scientific record. sign in

arxiv: 2306.06070 · v3 · submitted 2023-06-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mind2Web: Towards a Generalist Agent for the Web

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords web agentsgeneralist agentslanguage instructionsdatasetlarge language modelsweb navigationHTML filteringinstruction following
0
0 comments X

The pith

Mind2Web supplies over 2000 real-world tasks on 137 live websites so language models can act as generalist agents that follow instructions across unseen sites and domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Mind2Web to overcome the limits of prior web-agent datasets that rely on simulated sites or narrow coverage. It gathers open-ended tasks from 31 domains, records how crowd workers actually complete them on live pages, and supplies the raw HTML and action traces needed for training. Experiments show that large language models reach usable success rates when a smaller model first filters the oversized HTML input, and this holds for websites the models never encountered before. The result supplies the scale and realism required to move toward agents that handle arbitrary web tasks without hand-crafted simulators. Substantial gaps remain, but the dataset gives a concrete starting point for further progress.

Core claim

Mind2Web contains more than 2,000 open-ended tasks collected from 137 real websites spanning 31 domains together with crowdsourced sequences of user actions. The dataset records full HTML pages, element identifiers, and the precise clicks, types, and scrolls needed to finish each task. When large language models receive the raw HTML directly they struggle with length and noise, yet first passing the HTML through a small language model for filtering raises both effectiveness and speed. The same pipeline produces decent performance even on websites and entire domains held out during training.

What carries the argument

The Mind2Web dataset of real-site tasks and action traces, paired with an LLM pipeline that first filters raw HTML via a smaller language model before generating actions.

If this is right

  • Agents trained on the dataset can attempt open-ended tasks on websites never seen in training.
  • HTML filtering by a small language model makes large models both faster and more accurate on real pages.
  • The same data collection method can be repeated to expand coverage without building new simulators.
  • Performance gaps on complex interactions point to needed advances in long-horizon planning and element grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filtering trick may transfer to other high-volume text environments such as mobile UIs or desktop applications.
  • Scaling the crowdsourcing process to thousands more sites would test whether current generalization limits are mainly data-size effects.
  • Combining Mind2Web traces with reinforcement learning on live sites could close the remaining performance gap without new human labels.

Load-bearing premise

The crowdsourced action sequences accurately reflect the steps a typical user would take on live websites and the 137 sites capture enough variety to support generalization to new sites.

What would settle it

Models trained on Mind2Web achieve near-zero success rates when tested on a fresh collection of websites drawn from domains outside the original 31.

read the original abstract

We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. Mind2Web introduces the first large-scale dataset for generalist web agents, consisting of over 2,000 open-ended tasks collected from 137 real-world websites spanning 31 domains, along with crowdsourced action sequences. The authors demonstrate that LLMs combined with HTML filtering by a small LM achieve decent performance even on unseen websites and domains, while noting substantial room for improvement, and release the dataset, code, and models.

Significance. This work supplies essential resources for generalist web agents by emphasizing real websites, task diversity, and broad interaction patterns, addressing gaps in prior simulated or narrow datasets. The open release of data and models, combined with the practical filtering baseline, supports reproducibility and further progress in web agent research.

minor comments (3)
  1. Abstract: the high-level claim of 'decent performance' on unseen websites/domains lacks specific metrics, baselines, or error analysis, which would better support the generalization results even if detailed numbers appear later in the paper.
  2. Dataset collection section: additional justification for how the 137 sites and 31 domains were chosen would help substantiate the claim that they provide sufficient diversity for generalization to arbitrary new sites.
  3. Evaluation: include a brief error analysis or breakdown of failure modes for the LLM+small-LM filtering approach on complex multi-step tasks to clarify where the 'substantial room to improve' lies.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of Mind2Web, recognition of its significance as the first large-scale real-world dataset for generalist web agents, and recommendation of minor revision. We appreciate the emphasis on the dataset's diversity, use of actual websites, and open release of data, code, and models.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new dataset Mind2Web by collecting open-ended tasks and crowdsourced action sequences from real websites, then empirically evaluates LLM-based agents with HTML filtering on held-out websites and domains never seen during training. No load-bearing step reduces a claimed prediction or result to its own inputs by construction, self-definition, or self-citation chain; performance numbers are measured on independent test splits rather than being statistically forced from fitted parameters within the same equations. The three 'necessary ingredients' are supplied directly by the dataset construction process itself and are not derived from prior results in a circular manner. Standard train/test separation on unseen sites ensures the central claims remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about the reliability of crowdsourced human demonstrations and the ability of LLMs to follow instructions when given filtered HTML, but introduces no new fitted parameters or invented entities beyond existing LLM technology.

axioms (1)
  • domain assumption Crowdsourced action sequences faithfully represent how humans complete the described tasks on live websites
    Invoked when treating the collected sequences as ground-truth training and evaluation data for generalist agents.

pith-pipeline@v0.9.0 · 5570 in / 1359 out tokens · 84418 ms · 2026-05-15T20:02:09.262992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  2. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.

  3. Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.

  4. Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

    cs.CL 2026-05 conditional novelty 7.0

    Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

  5. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  6. WAAA! Web Adversaries Against Agentic Browsers

    cs.CR 2026-05 unverdicted novelty 7.0

    Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.

  7. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  8. PlayCoder: Making LLM-Generated GUI Code Playable

    cs.SE 2026-04 conditional novelty 7.0

    PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

  9. GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

    cs.LG 2026-04 conditional novelty 7.0

    GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.

  10. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    cs.LG 2024-03 unverdicted novelty 7.0

    WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.

  11. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

  12. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  13. Structured Distillation of Web Agent Capabilities Enables Generalization

    cs.LG 2026-04 unverdicted novelty 6.0

    Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.

  14. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  15. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  16. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  17. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    cs.CL 2024-03 conditional novelty 6.0

    InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker in...

  18. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    cs.CL 2024-01 unverdicted novelty 6.0

    WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgment...

  19. GPT-4V(ision) is a Generalist Web Agent, if Grounded

    cs.IR 2024-01 conditional novelty 6.0

    GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.

  20. Tuning Qwen2.5-VL to Improve Its Web Interaction Skills

    cs.HC 2026-02 unverdicted novelty 5.0

    Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.

  21. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  22. Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

    cs.AI 2026-02 unverdicted novelty 4.0

    Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.

  23. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 21 Pith papers · 12 internal anchors

  1. [1]

    https://github.com/puppeteer/puppeteer, 2021

    Puppeteer headless chrome node.js api. https://github.com/puppeteer/puppeteer, 2021

  2. [2]

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-H...

  3. [3]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ B. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Bryn- jolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Es...

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  5. [5]

    Plum- mer

    Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plum- mer. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, 2022

  6. [6]

    Ohio supercomputer center, 1987

    Ohio Supercomputer Center. Ohio supercomputer center, 1987. URL http://osc.edu/ark: /19495/f5s1ph73

  7. [7]

    How many websites are there? how many are active in 2023? https: //webtribunal.net/blog/how-many-websites/

    Radoslav Chakarov. How many websites are there? how many are active in 2023? https: //webtribunal.net/blog/how-many-websites/. 2023

  8. [8]

    Reading Wikipedia to answer open-domain questions

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://acl...

  9. [9]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, 11 James Br...

  10. [10]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...

  11. [11]

    Openagi: When LLM meets domain experts

    Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When LLM meets domain experts. CoRR, abs/2304.04370, 2023. doi: 10.48550/ arXiv.2304.04370. URL https://doi.org/10.48550/arXiv.2304.04370

  12. [12]

    Sadler, Percy Liang, Xifeng Yan, and Yu Su

    Yu Gu, Sue Kase, Michelle Vanni, Brian M. Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia, editors,WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 3477–...

  13. [13]

    Don’t generate, discriminate: A proposal for grounding language models to real-world environments

    Yu Gu, Xiang Deng, and Yu Su. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. CoRR, abs/2212.09736, 2022. doi: 10.48550/ arXiv.2212.09736. URL https://doi.org/10.48550/arXiv.2212.09736

  14. [14]

    Understanding html with large language models, 2023

    Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowd- hery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models, 2023

  15. [15]

    arXiv preprint arXiv:2305.11554

    Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554, 2023. doi: 10.48550/arXiv.2305.11554. URL https://doi.org/10.48550/arXiv.2305.11554

  16. [16]

    Deberta: decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https: //openreview.net/forum?id=XPZIaotutsD

  17. [17]

    Structgpt: A general framework for large language model to reason over structured data.arXiv preprint arXiv:2305.09645, 2023

    Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Struct- gpt: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645, 2023. doi: 10.48550/arXiv.2305.09645. URLhttps://doi.org/10.48550/ arXiv.2305.09645

  18. [18]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Lin- guistics....

  19. [19]

    Haber, Tara Matthews, and Tessa Lau

    Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. CoScripter: Automating & sharing how-to knowledge in the enterprise. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1719–1728, Florence Italy, April 2008. ACM. ISBN 978-1-60558-011-1. doi: 10.1145/1357054.1357323. 12

  20. [20]

    API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244, 2023. doi: 10.48550/arXiv.2304.08244. URL https://doi.org/10.48550/arXiv.2304.08244

  21. [21]

    Mapping natural language instructions to mobile UI action sequences

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile UI action sequences. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8198–8210. Asso...

  22. [22]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/ forum...

  23. [23]

    Augmented Language Models: a Survey

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: a survey. CoRR, abs/2302.07842, 2023. doi: 10.48550/arXiv.2302.07842. URLhttps://doi.org/10.48550/ arXiv.2302.07842

  24. [24]

    Chatgpt plugins

    OpenAI. Chatgpt plugins. https://openai.com/blog/chatgpt-plugins. 2023

  25. [25]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  26. [26]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023. doi: 10.48550/arXiv.2305. 15334. URL https://doi.org/10.48550/arXiv.2305.15334

  27. [27]

    Tool learning with foundation models

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

  28. [28]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/ abs/1908.10084

  29. [29]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023. doi: 10.48550/arXiv.2302.04761. URL https://doi.org/10.48550/arXiv.2302.04761

  30. [30]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug- ginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580,

  31. [31]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    doi: 10.48550/arXiv.2303.17580. URL https://doi.org/10.48550/arXiv.2303. 17580

  32. [32]

    World of Bits: An Open-Domain Platform for Web-Based Agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, July 2017

  33. [33]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10737–10746. 13 Computer Vis...

  34. [34]

    Sadler, Wei-Lun Chao, and Yu Su

    Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. CoRR, abs/2212.04088, 2022. doi: 10.48550/arXiv.2212.04088. URL https://doi.org/10. 48550/arXiv.2212.04088

  35. [35]

    Encarnación

    Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark J. Encarnación. Building natural language interfaces to web apis. In Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tse...

  36. [36]

    META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022

  37. [37]

    RAT- SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers

    Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. RAT- SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.677

  38. [38]

    Automatic task completion flows from web apis

    Kyle Williams, Seyyed Hadi Hashemi, and Imed Zitouni. Automatic task completion flows from web apis. In Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer, editors, Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 2...

  39. [39]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

  40. [40]

    Landay, and Monica S

    Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James A. Landay, and Monica S. Lam. Grounding open-domain instructions to automate web support tasks. In North American Chapter of the Association for Computational Linguistics, 2021

  41. [42]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. CoRR, abs/2210.03629, 2022. doi: 10.48550/arXiv.2210.03629. URL https://doi.org/10.48550/arXiv.2210.03629

  42. [43]

    Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base

    Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331...

  43. [44]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedi...

  44. [45]

    URL https://doi.org/10.18653/v1/d18-1425

    doi: 10.18653/v1/d18-1425. URL https://doi.org/10.18653/v1/d18-1425

  45. [47]

    Zhang, Sebas- tian Baltes, and Christoph Treude

    Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, and Lichao Sun. A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt.CoRR, abs/2302.09419, 2023. doi: 10.48550/ARXIV .2302.09...