hub Canonical reference

Language Models can Solve Computer Tasks

Geunwoo Kim, Pierre Baldi, Stephen McAleer · 2023 · cs.CL · arXiv 2303.17491

Canonical reference. 80% of citing Pith papers cite this work as background.

22 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 22 citing papers arXiv PDF

abstract

Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting with external feedback. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https://github.com/posgnu/rci-agent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 other 1

citation-polarity summary

background 4 unclear 1

representative citing papers

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

cs.LG · 2024-03-12 · unverdicted · novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Large Language Models as Optimizers

cs.LG · 2023-09-07 · unverdicted · novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.

Reflexion: Language Agents with Verbal Reinforcement Learning

cs.AI · 2023-03-20 · conditional · novelty 7.0

Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

Signal-Driven Observation for Long-Horizon Web Agents

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

cs.CL · 2026-04-23 · conditional · novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

GPT-4V(ision) is a Generalist Web Agent, if Grounded

cs.IR · 2024-01-03 · conditional · novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.

Cognitive Architectures for Language Agents

cs.AI · 2023-09-05 · accept · novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.

Gorilla: Large Language Model Connected with Massive APIs

cs.CL · 2023-05-24 · conditional · novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

cs.CL · 2023-05-23 · conditional · novelty 6.0

ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.

Teaching Large Language Models to Self-Debug

cs.CL · 2023-04-11 · unverdicted · novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.

Compile-time Security Analysis and Optimization of Sensitive String Producers

cs.PL · 2026-05-15 · unverdicted · novelty 5.0

A language-integrated framework for compile-time analysis of sensitive string producers that minimizes lexical distance between secure and insecure idioms.

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

cs.CL · 2025-03-12 · unverdicted · novelty 5.0

Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.

Large Language Model-Brained GUI Agents: A Survey

cs.AI · 2024-11-27 · unverdicted · novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Understanding the planning of LLM agents: A survey

cs.AI · 2024-02-05 · accept · novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

cs.CV · 2023-09-29 · conditional · novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

citing papers explorer

Showing 22 of 22 citing papers.

Instruction Tuning with GPT-4 cs.CL · 2023-04-06 · unverdicted · none · ref 7 · internal anchor
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? cs.LG · 2024-03-12 · unverdicted · none · ref 11 · internal anchor
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 88 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Large Language Models as Optimizers cs.LG · 2023-09-07 · unverdicted · none · ref 12 · internal anchor
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
Reflexion: Language Agents with Verbal Reinforcement Learning cs.AI · 2023-03-20 · conditional · none · ref 10 · internal anchor
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks cs.AI · 2026-06-28 · unverdicted · none · ref 36 · internal anchor
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
Signal-Driven Observation for Long-Horizon Web Agents cs.CL · 2026-06-04 · unverdicted · none · ref 38 · internal anchor
Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 56 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 36 · internal anchor
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 14 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 146 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents cs.HC · 2024-01-17 · unverdicted · none · ref 81 · internal anchor
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
GPT-4V(ision) is a Generalist Web Agent, if Grounded cs.IR · 2024-01-03 · conditional · none · ref 13 · internal anchor
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 38 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Gorilla: Large Language Model Connected with Massive APIs cs.CL · 2023-05-24 · conditional · none · ref 17 · internal anchor
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models cs.CL · 2023-05-23 · conditional · none · ref 6 · internal anchor
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
Teaching Large Language Models to Self-Debug cs.CL · 2023-04-11 · unverdicted · none · ref 101 · internal anchor
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
Compile-time Security Analysis and Optimization of Sensitive String Producers cs.PL · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
A language-integrated framework for compile-time analysis of sensitive string producers that minimizes lexical distance between secure and insecure idioms.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks cs.CL · 2025-03-12 · unverdicted · none · ref 18 · internal anchor
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 105 · internal anchor
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
Understanding the planning of LLM agents: A survey cs.AI · 2024-02-05 · accept · none · ref 20 · internal anchor
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) cs.CV · 2023-09-29 · conditional · none · ref 63 · internal anchor
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Language Models can Solve Computer Tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer