arxiv: 2410.23218 · v1 · submitted 2024-10-30 · 💻 cs.CL · cs.CV· cs.HC

Recognition: 2 theorem links

· Lean Theorem

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Chengyou Jia, Fangzhi Xu, Kanzhi Cheng, Liheng Chen, Paul Pu Liang, Qiushi Sun, Yian Wang, Yu Qiao, Zhenyu Wu, Zhiyong Wu, Zichen Ding

Pith reviewed 2026-05-13 09:25 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.HC

keywords GUI agentsGUI groundingvision-language modelsopen-source datasetaction modelcross-platformout-of-distribution generalization

0 comments

The pith

A foundation model for GUI agents trained on a synthesized cross-platform dataset of over 13 million elements achieves strong performance on grounding and out-of-distribution tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an open-source toolkit to create large-scale GUI grounding data from multiple operating systems and platforms. This data is used to train OS-Atlas, a model that understands GUI screenshots and performs actions. Evaluations on six benchmarks show it improves over previous open-source models, especially in handling new interfaces. This approach addresses the reliance on commercial models by providing a scalable way to build capable open GUI agents. If successful, it opens the door for more accessible and customizable GUI automation tools.

Core claim

OS-Atlas demonstrates that a large cross-platform GUI grounding corpus, generated through an open-source synthesis toolkit, combined with targeted model training, enables open-source vision-language models to achieve significant gains in GUI grounding accuracy and generalization to out-of-distribution scenarios across mobile, desktop, and web platforms.

What carries the argument

The OS-Atlas model, built on innovations in data synthesis via an open toolkit and model training for GUI action prediction.

If this is right

Open-source GUI agents can now compete with closed-source ones without relying on commercial APIs.
Scaling the dataset further could lead to even better performance on complex agentic tasks.
The synthesis method allows for continuous improvement by generating more diverse data.
It provides insights into what makes GUI understanding work in open models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could reduce costs for developing GUI automation by avoiding paid model services.
The toolkit might be adapted for other visual interaction domains like robotics.
Future work could test if the same data helps in non-GUI visual tasks.

Load-bearing premise

The synthesized GUI data from the toolkit closely matches real-world interface interactions and supports learning that transfers to new, unseen applications.

What would settle it

Evaluating the model on a set of completely new GUI applications not covered in the synthesis process and finding no improvement in grounding accuracy compared to baselines would falsify the claim.

read the original abstract

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OS-Atlas delivers a large new open GUI grounding dataset and toolkit plus a model with reported benchmark gains, but the synthetic data's match to real interfaces remains the untested link.

read the letter

The standout piece is the open-source toolkit and the 13-million-element cross-platform GUI grounding corpus it produced, covering Windows, Linux, Mac, Android, and web. That scale is new for open work in this area, and the model trained on it reportedly improves grounding and OOD agent performance over prior open VLMs on six benchmarks. Releasing both the data and the generation code lowers the barrier for anyone who wants to train interface agents without closed models, which is a practical step forward. The engineering focus on multi-platform synthesis is the part that actually moves the field.

Referee Report

2 major / 1 minor

Summary. The paper introduces OS-Atlas, a foundation action model for GUI agents. It describes an open-source toolkit for synthesizing cross-platform GUI grounding data (Windows, Linux, macOS, Android, web) that yields a released corpus of over 13 million elements, combined with modeling innovations to improve GUI screenshot understanding and generalization. The central claim is that this yields significant performance gains over prior SOTA on six benchmarks spanning mobile, desktop, and web platforms for both grounding and OOD agentic tasks.

Significance. If the reported gains prove robust, the work would supply a valuable open-source resource and baseline for GUI agent research, lowering dependence on closed VLMs and enabling further scaling of agentic capabilities. The public release of the 13 M element corpus is a concrete strength that could support reproducible follow-on work.

major comments (2)

[Data synthesis] § on data synthesis (toolkit and corpus construction): the headline generalization and OOD claims rest on the assumption that the 13 M synthetic elements faithfully capture real GUI variability, noise, and state changes. No distributional comparison (pixel statistics, layout entropy, element-type frequencies, or failure-mode overlap) with held-out real traces is supplied, leaving open the possibility that benchmark gains are artifacts of the synthesis process rather than evidence of true robustness.
[Evaluation] Evaluation section: the abstract asserts 'significant performance improvements' across six benchmarks, yet the manuscript supplies neither the concrete metrics (e.g., accuracy deltas, per-platform scores), error breakdowns, nor ablations isolating the contribution of data scale versus modeling changes. Without these, the load-bearing causal link between the new corpus and the claimed advances cannot be verified.

minor comments (1)

[Abstract] Abstract: replace the qualitative phrase 'significant performance improvements' with at least one concrete metric or table reference so readers can immediately gauge the magnitude of the gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on OS-Atlas. We address each major comment point-by-point below, indicating planned revisions to improve clarity and support for our claims.

read point-by-point responses

Referee: [Data synthesis] § on data synthesis (toolkit and corpus construction): the headline generalization and OOD claims rest on the assumption that the 13 M synthetic elements faithfully capture real GUI variability, noise, and state changes. No distributional comparison (pixel statistics, layout entropy, element-type frequencies, or failure-mode overlap) with held-out real traces is supplied, leaving open the possibility that benchmark gains are artifacts of the synthesis process rather than evidence of true robustness.

Authors: We acknowledge that explicit distributional comparisons would strengthen the robustness argument. The synthesis toolkit generates elements by programmatically traversing real GUI hierarchies and injecting controlled variations in layout, occlusion, and state transitions across the five platforms. While this process is intended to approximate real variability, the current manuscript does not include side-by-side statistics (e.g., pixel histograms or element-type frequencies) against held-out human traces. We will add these analyses in the revision, using a small set of real interaction logs we have collected, to quantify fidelity and address the concern directly. revision: yes
Referee: [Evaluation] Evaluation section: the abstract asserts 'significant performance improvements' across six benchmarks, yet the manuscript supplies neither the concrete metrics (e.g., accuracy deltas, per-platform scores), error breakdowns, nor ablations isolating the contribution of data scale versus modeling changes. Without these, the load-bearing causal link between the new corpus and the claimed advances cannot be verified.

Authors: The full manuscript contains tables reporting per-benchmark accuracies and deltas versus prior open-source models, with separate columns for mobile, desktop, and web platforms. However, we agree that error breakdowns (e.g., by grounding failure type) and ablations separating data-scale effects from modeling changes are missing. We will expand the evaluation section with these elements, including an ablation on corpus size (1M vs. 13M) and per-error-type analysis, to make the contributions transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims rest on synthesizing a new 13M-element GUI grounding corpus via an open-source toolkit and measuring performance gains on six external benchmarks spanning mobile/desktop/web platforms. No equations, fitted parameters, or self-referential definitions appear in the provided text; the central result (improved grounding and OOD behavior) is evaluated against prior SOTA models on independent test sets rather than reducing to the synthesis process by construction. Model training innovations are presented as additive contributions without load-bearing self-citations that collapse the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; training presumably follows standard VLM fine-tuning practices whose details are not stated.

pith-pipeline@v0.9.0 · 5565 in / 1114 out tokens · 81172 ms · 2026-05-13T09:25:11.942191+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
cs.AI 2026-05 unverdicted novelty 7.0

DUDE framework reduces web agents' susceptibility to deceptive UIs by 53.8% on a new 1,407-scenario benchmark while preserving task performance.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
cs.LG 2026-04 conditional novelty 7.0

GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
BAMI: Training-Free Bias Mitigation in GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
cs.CV 2026-04 unverdicted novelty 6.0

Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

UI-Copilot adds a selective copilot for memory and math to GUI agents and trains tool use with separate single-turn and multi-turn optimization, yielding SOTA results on MemGUI-Bench and a 17.1% gain on AndroidWorld.
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

UI-in-the-Loop makes multimodal models explicitly learn UI element locations, meanings, and uses in a cyclic screen-element-action loop, delivering better UI comprehension and GUI reasoning on a new 26K-sample benchmark.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
cs.AI 2026-04 unverdicted novelty 5.0

LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows
cs.CL 2026-04 unverdicted novelty 4.0

A deployed BPM system uses copilot feedback to learn when to automate UI actions, achieving 45% session automation and 39% reduced handling time without quality loss.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
cs.CV 2026-05 unverdicted novelty 3.0

X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 23 Pith papers · 19 internal anchors

[1]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa g nak Ta s rlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b

work page 2023
[9]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 b . URL https://openreview.net/forum?id=kiYqbO3wqw

work page 2023
[10]

Durante, Q

Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, and Jianfeng Gao. Agent ai: Surveying the horizons of multimodal interaction, 2024. URL https://arxiv.org/abs/2401.03568

work page arXiv 2024
[12]

Lee, and Jindong Chen

Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby B. Lee, and Jindong Chen. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In AAAI Conference on Artificial Intelligence, 2020. URL https://api.semanticscholar.org/CorpusID:229363676

work page 2020
[13]

Meta GPT : Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J \"u rgen Schmidhuber. Meta GPT : Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representatio...

work page 2024
[14]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14281--14290, 2024 b

work page 2024
[16]

Pix2struct: Screenshot parsing as pretraining for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. ArXiv, abs/2210.03347, 2022. URL https://api.semanticscholar.org/CorpusID:252762394

work page arXiv 2022
[19]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024

work page 2024
[23]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Screen2words: Automatic mobile ui summarization with multimodal learning

Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2words: Automatic mobile ui summarization with multimodal learning. The 34th Annual ACM Symposium on User Interface Software and Technology, 2021. URL https://api.semanticscholar.org/CorpusID:236957064

work page 2021
[29]

arXiv preprint arXiv:2308.11432 , year=

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. http://arxiv.org/abs/2308.11432, 2023

work page arXiv 2023
[31]

Webaim: The webaim million - the 2024 report on the accessibility of the top 1,000,000 home pages, 2024

WebAIM. Webaim: The webaim million - the 2024 report on the accessibility of the top 1,000,000 home pages, 2024. URL https://webaim.org/projects/million/. Accessed on September 30, 2024

work page 2024
[32]

Llm-powered autonomous agents

Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, Jun 2023. URL https://lilianweng.github.io/posts/2023-06-23-agent/

work page 2023
[43]

Android in the zoo: Chain-of-action-thought for gui agents, 2024 d

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024 d

work page 2024
[44]

Gpt-4v(ision) is a generalist web agent, if grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=piecKJ2DlB

work page 2024
[46]

2000 , publisher=

Speech & language processing , author=. 2000 , publisher=

work page 2000
[47]

Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: an autonomous agent with dynamic memory and self-reflection , author=. arXiv preprint arXiv:2303.11366 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2401.13178 , year=

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents , author=. arXiv preprint arXiv:2401.13178 , year=

work page arXiv
[49]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-instruct: Aligning language model with self generated instructions , author=. arXiv preprint arXiv:2212.10560 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Keep calm and explore: Language models for action generation in text-based games

Keep calm and explore: Language models for action generation in text-based games , author=. arXiv preprint arXiv:2010.02903 , year=

work page arXiv 2010
[51]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Advances in Neural Information Processing Systems , volume=

Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

arXiv preprint arXiv:2306.14898 , year=

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , author=. arXiv preprint arXiv:2306.14898 , year=

work page arXiv
[54]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

work page 2022
[55]

Advances in Neural Information Processing Systems , volume=

Language models enable zero-shot prediction of the effects of mutations on protein function , author=. Advances in Neural Information Processing Systems , volume=

work page
[56]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in Neural Information Processing Systems , volume=

Minedojo: Building open-ended embodied agents with internet-scale knowledge , author=. Advances in Neural Information Processing Systems , volume=

work page
[58]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

PaLM-E: An Embodied Multimodal Language Model

Palm-e: An embodied multimodal language model , author=. arXiv preprint arXiv:2303.03378 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

arXiv preprint arXiv:2306.07209 , year=

Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow , author=. arXiv preprint arXiv:2306.07209 , year=

work page arXiv
[62]

arXiv preprint arXiv:2305.19308 , year=

SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models , author=. arXiv preprint arXiv:2305.19308 , year=

work page arXiv
[63]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Llava-next: Improved reasoning, ocr, and world knowledge , author=

work page
[65]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. arXiv preprint arXiv:2404.16821 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[67]

arXiv preprint arXiv:2310.10634 , year=

OpenAgents: An Open Platform for Language Agents in the Wild , author=. arXiv preprint arXiv:2310.10634 , year=

work page arXiv
[68]

Mind2web: Towards a generalist agent for the web, 2023

Mind2Web: Towards a Generalist Agent for the Web , author=. arXiv preprint arXiv:2306.06070 , year=

work page arXiv
[69]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. arXiv preprint arXiv:2404.07972 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

arXiv preprint arXiv:2406.11317 , year=

GUICourse: From General Vision Language Models to Versatile GUI Agents , author=. arXiv preprint arXiv:2406.11317 , year=

work page arXiv
[71]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. arXiv preprint arXiv:2303.05499 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Cognitive architectures for language agents, 2024

Cognitive architectures for language agents , author=. arXiv preprint arXiv:2309.02427 , year=

work page arXiv
[73]

lilianweng.github.io

Weng, Lilian. lilianweng.github.io. 2023

work page 2023
[74]

A Survey on Large Language Model based Autonomous Agents , journal=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Ji-Rong , year=. A Survey on Large Language Model based Autonomous Agents , journal=

work page
[75]

arXiv preprint arXiv:2403.14734 , year=

A survey of neural code intelligence: Paradigms, advances and beyond , author=. arXiv preprint arXiv:2403.14734 , year=

work page arXiv
[76]

Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

Augmented language models: a survey , author=. arXiv preprint arXiv:2302.07842 , year=

work page arXiv
[77]

arXiv preprint arXiv:2310.00280 , year=

Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration , author=. arXiv preprint arXiv:2310.00280 , year=

work page arXiv
[78]

Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=

work page
[79]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Tree of thoughts: Deliberate problem solving with large language models, may 2023 , author=. arXiv preprint arXiv:2305.10601 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

arXiv preprint , year =

ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models , author=. arXiv preprint arXiv:2305.18323 , year=

work page arXiv
[81]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models , author=. arXiv preprint arXiv:2305.04091 , year=

work page arXiv
[82]

Graph of thoughts: Solving elaborate problems with large language models

Graph of thoughts: Solving elaborate problems with large language models , author=. arXiv preprint arXiv:2308.09687 , year=

work page arXiv
[83]

arXiv preprint arXiv:2212.10375 , year=

Self-adaptive in-context learning , author=. arXiv preprint arXiv:2212.10375 , year=

work page arXiv
[84]

arXiv preprint arXiv:2303.02913 , year=

Openicl: An open-source framework for in-context learning , author=. arXiv preprint arXiv:2303.02913 , year=

work page arXiv
[85]

Toolformer: Language Models Can Teach Themselves to Use Tools

Toolformer: Language models can teach themselves to use tools , author=. arXiv preprint arXiv:2302.04761 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

arXiv preprint arXiv:2305.17126 , year=

Large language models as tool makers , author=. arXiv preprint arXiv:2305.17126 , year=

work page arXiv
[87]

arXiv preprint arXiv:2307.16376 , year=

When large language models meet personalization: Perspectives of challenges and opportunities , author=. arXiv preprint arXiv:2307.16376 , year=

work page arXiv
[88]

Cognition , volume=

Referring as a collaborative process , author=. Cognition , volume=. 1986 , publisher=

work page 1986
[89]

Assistgui: Task-oriented desktop graphical user interface automation

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation , author=. arXiv preprint arXiv:2312.13108 , year=

work page arXiv
[90]

Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu

AppAgent: Multimodal Agents as Smartphone Users , author=. arXiv preprint arXiv:2312.13771 , year=

work page arXiv
[91]

arXiv preprint arXiv:2212.09603 , year=

Explanation regeneration via information bottleneck , author=. arXiv preprint arXiv:2212.09603 , year=

work page arXiv
[92]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[93]

arXiv preprint arXiv:2310.19852 , year=

Ai alignment: A comprehensive survey , author=. arXiv preprint arXiv:2310.19852 , year=

work page arXiv
[94]

arXiv preprint arXiv:1910.03655 , year=

Executing instructions in situated collaborative interactions , author=. arXiv preprint arXiv:1910.03655 , year=

work page arXiv 1910
[95]

Advances in Neural Information Processing Systems , volume=

Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[96]

2023 , eprint=

ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory , author=. 2023 , eprint=

work page 2023
[97]

TaskWeaver : A code-first agent framework, 2024

TaskWeaver: A Code-First Agent Framework , author=. arXiv preprint arXiv:2311.17541 , year=

work page arXiv
[98]

https://github.com/Significant-Gravitas/AutoGPT , year=

AutoGPT: build & use AI agents , author=. https://github.com/Significant-Gravitas/AutoGPT , year=

work page
[99]

arXiv preprint arXiv.2304.08354,

Tool learning with foundation models , author=. arXiv preprint arXiv:2304.08354 , year=

work page arXiv
[100]

GAIA: a benchmark for General AI Assistants

GAIA: a benchmark for General AI Assistants , author=. arXiv preprint arXiv:2311.12983 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[101]

Nature reviews neuroscience , volume=

Working memory: looking back and looking forward , author=. Nature reviews neuroscience , volume=. 2003 , publisher=

work page 2003
[102]

Brain research , volume=

Anxiety, cognition, and habit: a multiple memory systems perspective , author=. Brain research , volume=. 2009 , publisher=

work page 2009
[103]

Seeclick: Harnessing gui grounding for advanced visual gui agents

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents , author=. arXiv preprint arXiv:2401.10935 , year=

work page arXiv
[104]

Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents

Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents , author=. arXiv preprint arXiv:2302.01560 , year=

work page arXiv
[105]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[106]

, author=

Self-directed learning: A guide for learners and teachers. , author=. 1975 , journal=

work page 1975
[107]

and Kwiatkowski, Ariel and Balis, John U

Towers, Mark and Terry, Jordan K. and Kwiatkowski, Ariel and Balis, John U. and Cola, Gianluca de and Deleu, Tristan and Goulão, Manuel and Kallinteris, Andreas and KG, Arjun and Krimmel, Markus and Perez-Vicente, Rodrigo and Pierré, Andrea and Schulhoff, Sander and Tai, Jun Jet and Shen, Andrew Tan Jin and Younis, Omar G. , month = mar, journal=. Gymnasi...

work page
[108]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=

work page internal anchor Pith review arXiv
[109]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Unsupervised explanation generation via correct instantiations , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[110]

arXiv preprint arXiv:2311.01767 , year=

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion , author=. arXiv preprint arXiv:2311.01767 , year=

work page arXiv

Showing first 80 references.