arxiv: 2306.06070 · v3 · submitted 2023-06-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng , Yu Gu , Boyuan Zheng , Shijie Chen , Samuel Stevens , Boshi Wang , Huan Sun , Yu Su

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords web agentsgeneralist agentslanguage instructionsdatasetlarge language modelsweb navigationHTML filteringinstruction following

0 comments

The pith

Mind2Web supplies over 2000 real-world tasks on 137 live websites so language models can act as generalist agents that follow instructions across unseen sites and domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Mind2Web to overcome the limits of prior web-agent datasets that rely on simulated sites or narrow coverage. It gathers open-ended tasks from 31 domains, records how crowd workers actually complete them on live pages, and supplies the raw HTML and action traces needed for training. Experiments show that large language models reach usable success rates when a smaller model first filters the oversized HTML input, and this holds for websites the models never encountered before. The result supplies the scale and realism required to move toward agents that handle arbitrary web tasks without hand-crafted simulators. Substantial gaps remain, but the dataset gives a concrete starting point for further progress.

Core claim

Mind2Web contains more than 2,000 open-ended tasks collected from 137 real websites spanning 31 domains together with crowdsourced sequences of user actions. The dataset records full HTML pages, element identifiers, and the precise clicks, types, and scrolls needed to finish each task. When large language models receive the raw HTML directly they struggle with length and noise, yet first passing the HTML through a small language model for filtering raises both effectiveness and speed. The same pipeline produces decent performance even on websites and entire domains held out during training.

What carries the argument

The Mind2Web dataset of real-site tasks and action traces, paired with an LLM pipeline that first filters raw HTML via a smaller language model before generating actions.

If this is right

Agents trained on the dataset can attempt open-ended tasks on websites never seen in training.
HTML filtering by a small language model makes large models both faster and more accurate on real pages.
The same data collection method can be repeated to expand coverage without building new simulators.
Performance gaps on complex interactions point to needed advances in long-horizon planning and element grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering trick may transfer to other high-volume text environments such as mobile UIs or desktop applications.
Scaling the crowdsourcing process to thousands more sites would test whether current generalization limits are mainly data-size effects.
Combining Mind2Web traces with reinforcement learning on live sites could close the remaining performance gap without new human labels.

Load-bearing premise

The crowdsourced action sequences accurately reflect the steps a typical user would take on live websites and the 137 sites capture enough variety to support generalization to new sites.

What would settle it

Models trained on Mind2Web achieve near-zero success rates when tested on a fresh collection of websites drawn from domains outside the original 31.

read the original abstract

We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mind2Web releases a large real-world dataset for web agents that improves on prior simulated benchmarks, with a simple HTML filtering step that helps LLMs generalize to unseen sites.

read the letter

The main thing here is the dataset itself. Mind2Web gives over 2,000 tasks collected from 137 actual websites across 31 domains, plus crowdsourced action sequences. That scale and use of live sites is the clear step forward from earlier work that relied on toy environments or narrow scopes. They also show that running a small LM first to filter the raw HTML lets larger models handle the input without blowing up on size, and the results stay decent even on sites and domains held out from training. The open release of the data, code, and models is straightforward and useful for anyone who wants to test their own agent ideas against it.

Referee Report

0 major / 3 minor

Summary. Mind2Web introduces the first large-scale dataset for generalist web agents, consisting of over 2,000 open-ended tasks collected from 137 real-world websites spanning 31 domains, along with crowdsourced action sequences. The authors demonstrate that LLMs combined with HTML filtering by a small LM achieve decent performance even on unseen websites and domains, while noting substantial room for improvement, and release the dataset, code, and models.

Significance. This work supplies essential resources for generalist web agents by emphasizing real websites, task diversity, and broad interaction patterns, addressing gaps in prior simulated or narrow datasets. The open release of data and models, combined with the practical filtering baseline, supports reproducibility and further progress in web agent research.

minor comments (3)

Abstract: the high-level claim of 'decent performance' on unseen websites/domains lacks specific metrics, baselines, or error analysis, which would better support the generalization results even if detailed numbers appear later in the paper.
Dataset collection section: additional justification for how the 137 sites and 31 domains were chosen would help substantiate the claim that they provide sufficient diversity for generalization to arbitrary new sites.
Evaluation: include a brief error analysis or breakdown of failure modes for the LLM+small-LM filtering approach on complex multi-step tasks to clarify where the 'substantial room to improve' lies.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of Mind2Web, recognition of its significance as the first large-scale real-world dataset for generalist web agents, and recommendation of minor revision. We appreciate the emphasis on the dataset's diversity, use of actual websites, and open release of data, code, and models.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new dataset Mind2Web by collecting open-ended tasks and crowdsourced action sequences from real websites, then empirically evaluates LLM-based agents with HTML filtering on held-out websites and domains never seen during training. No load-bearing step reduces a claimed prediction or result to its own inputs by construction, self-definition, or self-citation chain; performance numbers are measured on independent test splits rather than being statistically forced from fitted parameters within the same equations. The three 'necessary ingredients' are supplied directly by the dataset construction process itself and are not derived from prior results in a circular manner. Standard train/test separation on unseen sites ensures the central claims remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about the reliability of crowdsourced human demonstrations and the ability of LLMs to follow instructions when given filtered HTML, but introduces no new fitted parameters or invented entities beyond existing LLM technology.

axioms (1)

domain assumption Crowdsourced action sequences faithfully represent how humans complete the described tasks on live websites
Invoked when treating the collected sequences as ground-truth training and evaluation data for generalist agents.

pith-pipeline@v0.9.0 · 5570 in / 1359 out tokens · 84418 ms · 2026-05-15T20:02:09.262992+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web... two-stage model that involves first using a fine-tuned small LM to filter the web elements and then using an LLM to select from the filtered elements in a multi-choice question answering fashion
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MINDACT significantly outperforms modeling strategies... achieves a decent level of generalization... on websites or entire domains the model has never seen before

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
WAAA! Web Adversaries Against Agentic Browsers
cs.CR 2026-05 unverdicted novelty 7.0

Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
cs.LG 2026-04 conditional novelty 7.0

GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
Structured Distillation of Web Agent Capabilities Enables Generalization
cs.LG 2026-04 unverdicted novelty 6.0

Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
cs.CL 2024-03 conditional novelty 6.0

InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker in...
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
cs.CL 2024-01 unverdicted novelty 6.0

WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgment...
GPT-4V(ision) is a Generalist Web Agent, if Grounded
cs.IR 2024-01 conditional novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
cs.HC 2026-02 unverdicted novelty 5.0

Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
cs.AI 2026-02 unverdicted novelty 4.0

Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 21 Pith papers · 12 internal anchors

[1]

https://github.com/puppeteer/puppeteer, 2021

Puppeteer headless chrome node.js api. https://github.com/puppeteer/puppeteer, 2021

work page 2021
[2]

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-H...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.01691 2022
[3]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ B. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Bryn- jolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Es...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 1901
[5]

Plum- mer

Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plum- mer. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, 2022

work page 2022
[6]

Ohio supercomputer center, 1987

Ohio Supercomputer Center. Ohio supercomputer center, 1987. URL http://osc.edu/ark: /19495/f5s1ph73

work page 1987
[7]

How many websites are there? how many are active in 2023? https: //webtribunal.net/blog/how-many-websites/

Radoslav Chakarov. How many websites are there? how many are active in 2023? https: //webtribunal.net/blog/how-many-websites/. 2023

work page 2023
[8]

Reading Wikipedia to answer open-domain questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://acl...

work page doi:10.18653/v1/p17-1171 2017
[9]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, 11 James Br...

work page 2023
[10]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.11416 2022
[11]

Openagi: When LLM meets domain experts

Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When LLM meets domain experts. CoRR, abs/2304.04370, 2023. doi: 10.48550/ arXiv.2304.04370. URL https://doi.org/10.48550/arXiv.2304.04370

work page doi:10.48550/arxiv.2304.04370 2023
[12]

Sadler, Percy Liang, Xifeng Yan, and Yu Su

Yu Gu, Sue Kase, Michelle Vanni, Brian M. Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia, editors,WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 3477–...

work page doi:10.1145/3442381.3449992 2021
[13]

Don’t generate, discriminate: A proposal for grounding language models to real-world environments

Yu Gu, Xiang Deng, and Yu Su. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. CoRR, abs/2212.09736, 2022. doi: 10.48550/ arXiv.2212.09736. URL https://doi.org/10.48550/arXiv.2212.09736

work page doi:10.48550/arxiv.2212.09736 2022
[14]

Understanding html with large language models, 2023

Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowd- hery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models, 2023

work page 2023
[15]

arXiv preprint arXiv:2305.11554

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554, 2023. doi: 10.48550/arXiv.2305.11554. URL https://doi.org/10.48550/arXiv.2305.11554

work page doi:10.48550/arxiv.2305.11554 2023
[16]

Deberta: decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https: //openreview.net/forum?id=XPZIaotutsD

work page 2021
[17]

Structgpt: A general framework for large language model to reason over structured data.arXiv preprint arXiv:2305.09645, 2023

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Struct- gpt: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645, 2023. doi: 10.48550/arXiv.2305.09645. URLhttps://doi.org/10.48550/ arXiv.2305.09645

work page doi:10.48550/arxiv.2305.09645 2023
[18]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Lin- guistics....

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[19]

Haber, Tara Matthews, and Tessa Lau

Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. CoScripter: Automating & sharing how-to knowledge in the enterprise. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1719–1728, Florence Italy, April 2008. ACM. ISBN 978-1-60558-011-1. doi: 10.1145/1357054.1357323. 12

work page doi:10.1145/1357054.1357323 2008
[20]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244, 2023. doi: 10.48550/arXiv.2304.08244. URL https://doi.org/10.48550/arXiv.2304.08244

work page internal anchor Pith review doi:10.48550/arxiv.2304.08244 2023
[21]

Mapping natural language instructions to mobile UI action sequences

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile UI action sequences. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8198–8210. Asso...

work page doi:10.18653/v1/2020.acl-main.729 2020
[22]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/ forum...

work page 2018
[23]

Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: a survey. CoRR, abs/2302.07842, 2023. doi: 10.48550/arXiv.2302.07842. URLhttps://doi.org/10.48550/ arXiv.2302.07842

work page internal anchor Pith review doi:10.48550/arxiv.2302.07842 2023
[24]

Chatgpt plugins

OpenAI. Chatgpt plugins. https://openai.com/blog/chatgpt-plugins. 2023

work page 2023
[25]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[26]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023. doi: 10.48550/arXiv.2305. 15334. URL https://doi.org/10.48550/arXiv.2305.15334

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305 2023
[27]

Tool learning with foundation models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

work page doi:10.48550/arxiv.2304.08354 2023
[28]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/ abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023. doi: 10.48550/arXiv.2302.04761. URL https://doi.org/10.48550/arXiv.2302.04761

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.04761 2023
[30]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug- ginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

doi: 10.48550/arXiv.2303.17580. URL https://doi.org/10.48550/arXiv.2303. 17580

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17580
[32]

World of Bits: An Open-Domain Platform for Web-Based Agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, July 2017

work page 2017
[33]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10737–10746. 13 Computer Vis...

work page doi:10.1109/cvpr42600.2020.01075 2020
[34]

Sadler, Wei-Lun Chao, and Yu Su

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. CoRR, abs/2212.04088, 2022. doi: 10.48550/arXiv.2212.04088. URL https://doi.org/10. 48550/arXiv.2212.04088

work page doi:10.48550/arxiv.2212.04088 2022
[35]

Encarnación

Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark J. Encarnación. Building natural language interfaces to web apis. In Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tse...

work page doi:10.1145/3132847.3133009 2017
[36]

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022

Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022

work page 2022
[37]

RAT- SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. RAT- SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.677

work page doi:10.18653/v1/2020.acl-main.677 2020
[38]

Automatic task completion flows from web apis

Kyle Williams, Seyyed Hadi Hashemi, and Imed Zitouni. Automatic task completion flows from web apis. In Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer, editors, Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 2...

work page doi:10.1145/3331184.3331318 2019
[39]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

work page 2020
[40]

Landay, and Monica S

Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James A. Landay, and Monica S. Lam. Grounding open-domain instructions to automate web support tasks. In North American Chapter of the Association for Computational Linguistics, 2021

work page 2021
[42]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. CoRR, abs/2210.03629, 2022. doi: 10.48550/arXiv.2210.03629. URL https://doi.org/10.48550/arXiv.2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2022
[43]

Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331...

work page doi:10.3115/v1/p15-1128 2015
[44]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedi...

work page 2018
[45]

URL https://doi.org/10.18653/v1/d18-1425

doi: 10.18653/v1/d18-1425. URL https://doi.org/10.18653/v1/d18-1425

work page doi:10.18653/v1/d18-1425
[47]

Zhang, Sebas- tian Baltes, and Christoph Treude

Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, and Lichao Sun. A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt.CoRR, abs/2302.09419, 2023. doi: 10.48550/ARXIV .2302.09...

work page internal anchor Pith review doi:10.48550/arxiv 2023