Recognition: 2 theorem links
· Lean TheoremOS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Pith reviewed 2026-05-13 09:25 UTC · model grok-4.3
The pith
A foundation model for GUI agents trained on a synthesized cross-platform dataset of over 13 million elements achieves strong performance on grounding and out-of-distribution tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OS-Atlas demonstrates that a large cross-platform GUI grounding corpus, generated through an open-source synthesis toolkit, combined with targeted model training, enables open-source vision-language models to achieve significant gains in GUI grounding accuracy and generalization to out-of-distribution scenarios across mobile, desktop, and web platforms.
What carries the argument
The OS-Atlas model, built on innovations in data synthesis via an open toolkit and model training for GUI action prediction.
If this is right
- Open-source GUI agents can now compete with closed-source ones without relying on commercial APIs.
- Scaling the dataset further could lead to even better performance on complex agentic tasks.
- The synthesis method allows for continuous improvement by generating more diverse data.
- It provides insights into what makes GUI understanding work in open models.
Where Pith is reading between the lines
- This could reduce costs for developing GUI automation by avoiding paid model services.
- The toolkit might be adapted for other visual interaction domains like robotics.
- Future work could test if the same data helps in non-GUI visual tasks.
Load-bearing premise
The synthesized GUI data from the toolkit closely matches real-world interface interactions and supports learning that transfers to new, unseen applications.
What would settle it
Evaluating the model on a set of completely new GUI applications not covered in the synthesis process and finding no improvement in grounding accuracy compared to baselines would falsify the claim.
read the original abstract
Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OS-Atlas, a foundation action model for GUI agents. It describes an open-source toolkit for synthesizing cross-platform GUI grounding data (Windows, Linux, macOS, Android, web) that yields a released corpus of over 13 million elements, combined with modeling innovations to improve GUI screenshot understanding and generalization. The central claim is that this yields significant performance gains over prior SOTA on six benchmarks spanning mobile, desktop, and web platforms for both grounding and OOD agentic tasks.
Significance. If the reported gains prove robust, the work would supply a valuable open-source resource and baseline for GUI agent research, lowering dependence on closed VLMs and enabling further scaling of agentic capabilities. The public release of the 13 M element corpus is a concrete strength that could support reproducible follow-on work.
major comments (2)
- [Data synthesis] § on data synthesis (toolkit and corpus construction): the headline generalization and OOD claims rest on the assumption that the 13 M synthetic elements faithfully capture real GUI variability, noise, and state changes. No distributional comparison (pixel statistics, layout entropy, element-type frequencies, or failure-mode overlap) with held-out real traces is supplied, leaving open the possibility that benchmark gains are artifacts of the synthesis process rather than evidence of true robustness.
- [Evaluation] Evaluation section: the abstract asserts 'significant performance improvements' across six benchmarks, yet the manuscript supplies neither the concrete metrics (e.g., accuracy deltas, per-platform scores), error breakdowns, nor ablations isolating the contribution of data scale versus modeling changes. Without these, the load-bearing causal link between the new corpus and the claimed advances cannot be verified.
minor comments (1)
- [Abstract] Abstract: replace the qualitative phrase 'significant performance improvements' with at least one concrete metric or table reference so readers can immediately gauge the magnitude of the gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on OS-Atlas. We address each major comment point-by-point below, indicating planned revisions to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Data synthesis] § on data synthesis (toolkit and corpus construction): the headline generalization and OOD claims rest on the assumption that the 13 M synthetic elements faithfully capture real GUI variability, noise, and state changes. No distributional comparison (pixel statistics, layout entropy, element-type frequencies, or failure-mode overlap) with held-out real traces is supplied, leaving open the possibility that benchmark gains are artifacts of the synthesis process rather than evidence of true robustness.
Authors: We acknowledge that explicit distributional comparisons would strengthen the robustness argument. The synthesis toolkit generates elements by programmatically traversing real GUI hierarchies and injecting controlled variations in layout, occlusion, and state transitions across the five platforms. While this process is intended to approximate real variability, the current manuscript does not include side-by-side statistics (e.g., pixel histograms or element-type frequencies) against held-out human traces. We will add these analyses in the revision, using a small set of real interaction logs we have collected, to quantify fidelity and address the concern directly. revision: yes
-
Referee: [Evaluation] Evaluation section: the abstract asserts 'significant performance improvements' across six benchmarks, yet the manuscript supplies neither the concrete metrics (e.g., accuracy deltas, per-platform scores), error breakdowns, nor ablations isolating the contribution of data scale versus modeling changes. Without these, the load-bearing causal link between the new corpus and the claimed advances cannot be verified.
Authors: The full manuscript contains tables reporting per-benchmark accuracies and deltas versus prior open-source models, with separate columns for mobile, desktop, and web platforms. However, we agree that error breakdowns (e.g., by grounding failure type) and ablations separating data-scale effects from modeling changes are missing. We will expand the evaluation section with these elements, including an ablation on corpus size (1M vs. 13M) and per-error-type analysis, to make the contributions transparent. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's claims rest on synthesizing a new 13M-element GUI grounding corpus via an open-source toolkit and measuring performance gains on six external benchmarks spanning mobile/desktop/web platforms. No equations, fitted parameters, or self-referential definitions appear in the provided text; the central result (improved grounding and OOD behavior) is evaluated against prior SOTA models on independent test sets rather than reducing to the synthesis process by construction. Model training innovations are presented as additive contributions without load-bearing self-citations that collapse the argument.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 25 Pith papers
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
DUDE framework reduces web agents' susceptibility to deceptive UIs by 53.8% on a new 1,407-scenario benchmark while preserving task performance.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
-
BAMI: Training-Free Bias Mitigation in GUI Grounding
BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
-
UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
UI-Copilot adds a selective copilot for memory and math to GUI agents and trains tool use with separate single-turn and multi-turn optimization, yielding SOTA results on MemGUI-Bench and a 17.1% gain on AndroidWorld.
-
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
UI-in-the-Loop makes multimodal models explicitly learn UI element locations, meanings, and uses in a cyclic screen-element-action loop, delivering better UI comprehension and GUI reasoning on a new 26K-sample benchmark.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows
A deployed BPM system uses copilot feedback to learn when to automate UI actions, achieving 45% session automation and 39% reduced handling time without quality loss.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.
Reference graph
Works this paper leans on
-
[1]
Introducing our multimodal models, 2023
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa g nak Ta s rlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b
work page 2023
-
[9]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 b . URL https://openreview.net/forum?id=kiYqbO3wqw
work page 2023
-
[10]
Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, and Jianfeng Gao. Agent ai: Surveying the horizons of multimodal interaction, 2024. URL https://arxiv.org/abs/2401.03568
-
[12]
Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby B. Lee, and Jindong Chen. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In AAAI Conference on Artificial Intelligence, 2020. URL https://api.semanticscholar.org/CorpusID:229363676
work page 2020
-
[13]
Meta GPT : Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J \"u rgen Schmidhuber. Meta GPT : Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representatio...
work page 2024
-
[14]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14281--14290, 2024 b
work page 2024
-
[16]
Pix2struct: Screenshot parsing as pretraining for visual language understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. ArXiv, abs/2210.03347, 2022. URL https://api.semanticscholar.org/CorpusID:252762394
-
[19]
Llava-next: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024
work page 2024
-
[23]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Screen2words: Automatic mobile ui summarization with multimodal learning
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2words: Automatic mobile ui summarization with multimodal learning. The 34th Annual ACM Symposium on User Interface Software and Technology, 2021. URL https://api.semanticscholar.org/CorpusID:236957064
work page 2021
-
[29]
arXiv preprint arXiv:2308.11432 , year=
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. http://arxiv.org/abs/2308.11432, 2023
-
[31]
WebAIM. Webaim: The webaim million - the 2024 report on the accessibility of the top 1,000,000 home pages, 2024. URL https://webaim.org/projects/million/. Accessed on September 30, 2024
work page 2024
-
[32]
Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, Jun 2023. URL https://lilianweng.github.io/posts/2023-06-23-agent/
work page 2023
-
[43]
Android in the zoo: Chain-of-action-thought for gui agents, 2024 d
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024 d
work page 2024
-
[44]
Gpt-4v(ision) is a generalist web agent, if grounded
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=piecKJ2DlB
work page 2024
- [46]
-
[47]
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion: an autonomous agent with dynamic memory and self-reflection , author=. arXiv preprint arXiv:2303.11366 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
arXiv preprint arXiv:2401.13178 , year=
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents , author=. arXiv preprint arXiv:2401.13178 , year=
-
[49]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Self-instruct: Aligning language model with self generated instructions , author=. arXiv preprint arXiv:2212.10560 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Keep calm and explore: Language models for action generation in text-based games
Keep calm and explore: Language models for action generation in text-based games , author=. arXiv preprint arXiv:2010.02903 , year=
-
[51]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Advances in Neural Information Processing Systems , volume=
Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
arXiv preprint arXiv:2306.14898 , year=
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , author=. arXiv preprint arXiv:2306.14898 , year=
-
[54]
Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=
work page 2022
-
[55]
Advances in Neural Information Processing Systems , volume=
Language models enable zero-shot prediction of the effects of mutations on protein function , author=. Advances in Neural Information Processing Systems , volume=
-
[56]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Advances in Neural Information Processing Systems , volume=
Minedojo: Building open-ended embodied agents with internet-scale knowledge , author=. Advances in Neural Information Processing Systems , volume=
-
[58]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
PaLM-E: An Embodied Multimodal Language Model
Palm-e: An embodied multimodal language model , author=. arXiv preprint arXiv:2303.03378 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
arXiv preprint arXiv:2306.07209 , year=
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow , author=. arXiv preprint arXiv:2306.07209 , year=
-
[62]
arXiv preprint arXiv:2305.19308 , year=
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models , author=. arXiv preprint arXiv:2305.19308 , year=
-
[63]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
Llava-next: Improved reasoning, ocr, and world knowledge , author=
-
[65]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. arXiv preprint arXiv:2404.16821 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =
work page 2024
-
[67]
arXiv preprint arXiv:2310.10634 , year=
OpenAgents: An Open Platform for Language Agents in the Wild , author=. arXiv preprint arXiv:2310.10634 , year=
-
[68]
Mind2web: Towards a generalist agent for the web, 2023
Mind2Web: Towards a Generalist Agent for the Web , author=. arXiv preprint arXiv:2306.06070 , year=
-
[69]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. arXiv preprint arXiv:2404.07972 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
arXiv preprint arXiv:2406.11317 , year=
GUICourse: From General Vision Language Models to Versatile GUI Agents , author=. arXiv preprint arXiv:2406.11317 , year=
-
[71]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. arXiv preprint arXiv:2303.05499 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Cognitive architectures for language agents, 2024
Cognitive architectures for language agents , author=. arXiv preprint arXiv:2309.02427 , year=
- [73]
-
[74]
A Survey on Large Language Model based Autonomous Agents , journal=
Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Ji-Rong , year=. A Survey on Large Language Model based Autonomous Agents , journal=
-
[75]
arXiv preprint arXiv:2403.14734 , year=
A survey of neural code intelligence: Paradigms, advances and beyond , author=. arXiv preprint arXiv:2403.14734 , year=
-
[76]
Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023
Augmented language models: a survey , author=. arXiv preprint arXiv:2302.07842 , year=
-
[77]
arXiv preprint arXiv:2310.00280 , year=
Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration , author=. arXiv preprint arXiv:2310.00280 , year=
-
[78]
Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=
-
[79]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of thoughts: Deliberate problem solving with large language models, may 2023 , author=. arXiv preprint arXiv:2305.10601 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models , author=. arXiv preprint arXiv:2305.18323 , year=
-
[81]
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models , author=. arXiv preprint arXiv:2305.04091 , year=
-
[82]
Graph of thoughts: Solving elaborate problems with large language models
Graph of thoughts: Solving elaborate problems with large language models , author=. arXiv preprint arXiv:2308.09687 , year=
-
[83]
arXiv preprint arXiv:2212.10375 , year=
Self-adaptive in-context learning , author=. arXiv preprint arXiv:2212.10375 , year=
-
[84]
arXiv preprint arXiv:2303.02913 , year=
Openicl: An open-source framework for in-context learning , author=. arXiv preprint arXiv:2303.02913 , year=
-
[85]
Toolformer: Language Models Can Teach Themselves to Use Tools
Toolformer: Language models can teach themselves to use tools , author=. arXiv preprint arXiv:2302.04761 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[86]
arXiv preprint arXiv:2305.17126 , year=
Large language models as tool makers , author=. arXiv preprint arXiv:2305.17126 , year=
-
[87]
arXiv preprint arXiv:2307.16376 , year=
When large language models meet personalization: Perspectives of challenges and opportunities , author=. arXiv preprint arXiv:2307.16376 , year=
-
[88]
Referring as a collaborative process , author=. Cognition , volume=. 1986 , publisher=
work page 1986
-
[89]
Assistgui: Task-oriented desktop graphical user interface automation
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation , author=. arXiv preprint arXiv:2312.13108 , year=
-
[90]
Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu
AppAgent: Multimodal Agents as Smartphone Users , author=. arXiv preprint arXiv:2312.13771 , year=
-
[91]
arXiv preprint arXiv:2212.09603 , year=
Explanation regeneration via information bottleneck , author=. arXiv preprint arXiv:2212.09603 , year=
-
[92]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[93]
arXiv preprint arXiv:2310.19852 , year=
Ai alignment: A comprehensive survey , author=. arXiv preprint arXiv:2310.19852 , year=
-
[94]
arXiv preprint arXiv:1910.03655 , year=
Executing instructions in situated collaborative interactions , author=. arXiv preprint arXiv:1910.03655 , year=
-
[95]
Advances in Neural Information Processing Systems , volume=
Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[96]
ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory , author=. 2023 , eprint=
work page 2023
-
[97]
TaskWeaver : A code-first agent framework, 2024
TaskWeaver: A Code-First Agent Framework , author=. arXiv preprint arXiv:2311.17541 , year=
-
[98]
https://github.com/Significant-Gravitas/AutoGPT , year=
AutoGPT: build & use AI agents , author=. https://github.com/Significant-Gravitas/AutoGPT , year=
-
[99]
arXiv preprint arXiv.2304.08354,
Tool learning with foundation models , author=. arXiv preprint arXiv:2304.08354 , year=
-
[100]
GAIA: a benchmark for General AI Assistants
GAIA: a benchmark for General AI Assistants , author=. arXiv preprint arXiv:2311.12983 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[101]
Nature reviews neuroscience , volume=
Working memory: looking back and looking forward , author=. Nature reviews neuroscience , volume=. 2003 , publisher=
work page 2003
-
[102]
Anxiety, cognition, and habit: a multiple memory systems perspective , author=. Brain research , volume=. 2009 , publisher=
work page 2009
-
[103]
Seeclick: Harnessing gui grounding for advanced visual gui agents
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents , author=. arXiv preprint arXiv:2401.10935 , year=
-
[104]
Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents , author=. arXiv preprint arXiv:2302.01560 , year=
-
[105]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [106]
-
[107]
and Kwiatkowski, Ariel and Balis, John U
Towers, Mark and Terry, Jordan K. and Kwiatkowski, Ariel and Balis, John U. and Cola, Gianluca de and Deleu, Tristan and Goulão, Manuel and Kallinteris, Andreas and KG, Arjun and Krimmel, Markus and Perez-Vicente, Rodrigo and Pierré, Andrea and Schulhoff, Sander and Tai, Jun Jet and Shen, Andrew Tan Jin and Younis, Omar G. , month = mar, journal=. Gymnasi...
-
[108]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=
work page internal anchor Pith review arXiv
-
[109]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Unsupervised explanation generation via correct instantiations , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[110]
arXiv preprint arXiv:2311.01767 , year=
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion , author=. arXiv preprint arXiv:2311.01767 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.