Recognition: 2 theorem links
· Lean TheoremMind2Web: Towards a Generalist Agent for the Web
Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3
The pith
Mind2Web supplies over 2000 real-world tasks on 137 live websites so language models can act as generalist agents that follow instructions across unseen sites and domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mind2Web contains more than 2,000 open-ended tasks collected from 137 real websites spanning 31 domains together with crowdsourced sequences of user actions. The dataset records full HTML pages, element identifiers, and the precise clicks, types, and scrolls needed to finish each task. When large language models receive the raw HTML directly they struggle with length and noise, yet first passing the HTML through a small language model for filtering raises both effectiveness and speed. The same pipeline produces decent performance even on websites and entire domains held out during training.
What carries the argument
The Mind2Web dataset of real-site tasks and action traces, paired with an LLM pipeline that first filters raw HTML via a smaller language model before generating actions.
If this is right
- Agents trained on the dataset can attempt open-ended tasks on websites never seen in training.
- HTML filtering by a small language model makes large models both faster and more accurate on real pages.
- The same data collection method can be repeated to expand coverage without building new simulators.
- Performance gaps on complex interactions point to needed advances in long-horizon planning and element grounding.
Where Pith is reading between the lines
- The same filtering trick may transfer to other high-volume text environments such as mobile UIs or desktop applications.
- Scaling the crowdsourcing process to thousands more sites would test whether current generalization limits are mainly data-size effects.
- Combining Mind2Web traces with reinforcement learning on live sites could close the remaining performance gap without new human labels.
Load-bearing premise
The crowdsourced action sequences accurately reflect the steps a typical user would take on live websites and the 137 sites capture enough variety to support generalization to new sites.
What would settle it
Models trained on Mind2Web achieve near-zero success rates when tested on a fresh collection of websites drawn from domains outside the original 31.
read the original abstract
We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. Mind2Web introduces the first large-scale dataset for generalist web agents, consisting of over 2,000 open-ended tasks collected from 137 real-world websites spanning 31 domains, along with crowdsourced action sequences. The authors demonstrate that LLMs combined with HTML filtering by a small LM achieve decent performance even on unseen websites and domains, while noting substantial room for improvement, and release the dataset, code, and models.
Significance. This work supplies essential resources for generalist web agents by emphasizing real websites, task diversity, and broad interaction patterns, addressing gaps in prior simulated or narrow datasets. The open release of data and models, combined with the practical filtering baseline, supports reproducibility and further progress in web agent research.
minor comments (3)
- Abstract: the high-level claim of 'decent performance' on unseen websites/domains lacks specific metrics, baselines, or error analysis, which would better support the generalization results even if detailed numbers appear later in the paper.
- Dataset collection section: additional justification for how the 137 sites and 31 domains were chosen would help substantiate the claim that they provide sufficient diversity for generalization to arbitrary new sites.
- Evaluation: include a brief error analysis or breakdown of failure modes for the LLM+small-LM filtering approach on complex multi-step tasks to clarify where the 'substantial room to improve' lies.
Simulated Author's Rebuttal
We thank the referee for their positive summary of Mind2Web, recognition of its significance as the first large-scale real-world dataset for generalist web agents, and recommendation of minor revision. We appreciate the emphasis on the dataset's diversity, use of actual websites, and open release of data, code, and models.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a new dataset Mind2Web by collecting open-ended tasks and crowdsourced action sequences from real websites, then empirically evaluates LLM-based agents with HTML filtering on held-out websites and domains never seen during training. No load-bearing step reduces a claimed prediction or result to its own inputs by construction, self-definition, or self-citation chain; performance numbers are measured on independent test splits rather than being statistically forced from fitted parameters within the same equations. The three 'necessary ingredients' are supplied directly by the dataset construction process itself and are not derived from prior results in a circular manner. Standard train/test separation on unseen sites ensures the central claims remain externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Crowdsourced action sequences faithfully represent how humans complete the described tasks on live websites
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web... two-stage model that involves first using a fine-tuned small LM to filter the web elements and then using an LLM to select from the filtered elements in a multi-choice question answering fashion
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MINDACT significantly outperforms modeling strategies... achieves a decent level of generalization... on websites or entire domains the model has never seen before
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
WAAA! Web Adversaries Against Agentic Browsers
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
-
Structured Distillation of Web Agent Capabilities Enables Generalization
Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker in...
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgment...
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
-
Tuning Qwen2.5-VL to Improve Its Web Interaction Skills
Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
Reference graph
Works this paper leans on
-
[1]
https://github.com/puppeteer/puppeteer, 2021
Puppeteer headless chrome node.js api. https://github.com/puppeteer/puppeteer, 2021
work page 2021
-
[2]
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-H...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.01691 2022
-
[3]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ B. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Bryn- jolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Es...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
work page 1901
- [5]
-
[6]
Ohio supercomputer center, 1987
Ohio Supercomputer Center. Ohio supercomputer center, 1987. URL http://osc.edu/ark: /19495/f5s1ph73
work page 1987
-
[7]
Radoslav Chakarov. How many websites are there? how many are active in 2023? https: //webtribunal.net/blog/how-many-websites/. 2023
work page 2023
-
[8]
Reading Wikipedia to answer open-domain questions
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://acl...
-
[9]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, 11 James Br...
work page 2023
-
[10]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.11416 2022
-
[11]
Openagi: When LLM meets domain experts
Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When LLM meets domain experts. CoRR, abs/2304.04370, 2023. doi: 10.48550/ arXiv.2304.04370. URL https://doi.org/10.48550/arXiv.2304.04370
-
[12]
Sadler, Percy Liang, Xifeng Yan, and Yu Su
Yu Gu, Sue Kase, Michelle Vanni, Brian M. Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia, editors,WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 3477–...
-
[13]
Don’t generate, discriminate: A proposal for grounding language models to real-world environments
Yu Gu, Xiang Deng, and Yu Su. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. CoRR, abs/2212.09736, 2022. doi: 10.48550/ arXiv.2212.09736. URL https://doi.org/10.48550/arXiv.2212.09736
-
[14]
Understanding html with large language models, 2023
Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowd- hery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models, 2023
work page 2023
-
[15]
arXiv preprint arXiv:2305.11554
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554, 2023. doi: 10.48550/arXiv.2305.11554. URL https://doi.org/10.48550/arXiv.2305.11554
-
[16]
Deberta: decoding-enhanced bert with disentangled attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https: //openreview.net/forum?id=XPZIaotutsD
work page 2021
-
[17]
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Struct- gpt: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645, 2023. doi: 10.48550/arXiv.2305.09645. URLhttps://doi.org/10.48550/ arXiv.2305.09645
-
[18]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Lin- guistics....
-
[19]
Haber, Tara Matthews, and Tessa Lau
Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. CoScripter: Automating & sharing how-to knowledge in the enterprise. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1719–1728, Florence Italy, April 2008. ACM. ISBN 978-1-60558-011-1. doi: 10.1145/1357054.1357323. 12
-
[20]
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244, 2023. doi: 10.48550/arXiv.2304.08244. URL https://doi.org/10.48550/arXiv.2304.08244
work page internal anchor Pith review doi:10.48550/arxiv.2304.08244 2023
-
[21]
Mapping natural language instructions to mobile UI action sequences
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile UI action sequences. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8198–8210. Asso...
-
[22]
Reinforcement learning on web interfaces using workflow-guided exploration
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/ forum...
work page 2018
-
[23]
Augmented Language Models: a Survey
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: a survey. CoRR, abs/2302.07842, 2023. doi: 10.48550/arXiv.2302.07842. URLhttps://doi.org/10.48550/ arXiv.2302.07842
work page internal anchor Pith review doi:10.48550/arxiv.2302.07842 2023
-
[24]
OpenAI. Chatgpt plugins. https://openai.com/blog/chatgpt-plugins. 2023
work page 2023
- [25]
-
[26]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023. doi: 10.48550/arXiv.2305. 15334. URL https://doi.org/10.48550/arXiv.2305.15334
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305 2023
-
[27]
Tool learning with foundation models
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...
-
[28]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/ abs/1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[29]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023. doi: 10.48550/arXiv.2302.04761. URL https://doi.org/10.48550/arXiv.2302.04761
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.04761 2023
-
[30]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug- ginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
doi: 10.48550/arXiv.2303.17580. URL https://doi.org/10.48550/arXiv.2303. 17580
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17580
-
[32]
World of Bits: An Open-Domain Platform for Web-Based Agents
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, July 2017
work page 2017
-
[33]
In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10737–10746. 13 Computer Vis...
-
[34]
Sadler, Wei-Lun Chao, and Yu Su
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. CoRR, abs/2212.04088, 2022. doi: 10.48550/arXiv.2212.04088. URL https://doi.org/10. 48550/arXiv.2212.04088
-
[35]
Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark J. Encarnación. Building natural language interfaces to web apis. In Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tse...
-
[36]
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022
Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022
work page 2022
-
[37]
RAT- SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. RAT- SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.677
-
[38]
Automatic task completion flows from web apis
Kyle Williams, Seyyed Hadi Hashemi, and Imed Zitouni. Automatic task completion flows from web apis. In Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer, editors, Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 2...
-
[39]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...
work page 2020
-
[40]
Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James A. Landay, and Monica S. Lam. Grounding open-domain instructions to automate web support tasks. In North American Chapter of the Association for Computational Linguistics, 2021
work page 2021
-
[42]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. CoRR, abs/2210.03629, 2022. doi: 10.48550/arXiv.2210.03629. URL https://doi.org/10.48550/arXiv.2210.03629
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2022
-
[43]
Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331...
-
[44]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedi...
work page 2018
-
[45]
URL https://doi.org/10.18653/v1/d18-1425
doi: 10.18653/v1/d18-1425. URL https://doi.org/10.18653/v1/d18-1425
-
[47]
Zhang, Sebas- tian Baltes, and Christoph Treude
Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, and Lichao Sun. A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt.CoRR, abs/2302.09419, 2023. doi: 10.48550/ARXIV .2302.09...
work page internal anchor Pith review doi:10.48550/arxiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.