pith. machine review for the scientific record. sign in

arxiv: 2303.17491 · v3 · pith:RIN4LFODnew · submitted 2023-03-30 · 💻 cs.CL · cs.AI· cs.HC· cs.LG

Language Models can Solve Computer Tasks

Pith reviewed 2026-05-17 12:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.LG
keywords language modelscomputer agentsprompting methodsMiniWoB++task automationrecursive self-critiquefew-shot learningreasoning enhancement
0
0 comments X

The pith

Pre-trained language models solve novel computer tasks by recursively criticizing and improving their own outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a pre-trained large language model can carry out general computer tasks described in natural language when guided by a simple prompting scheme in which the model repeatedly criticizes and refines its own proposed actions. This approach requires only a handful of demonstrations per task and no task-specific reward function, in contrast to earlier methods that depend on tens of thousands of expert examples or custom reinforcement signals. On the MiniWoB++ benchmark the method reaches state-of-the-art performance for the InstructGPT-3 model with RLHF, and it also improves results on a range of natural-language reasoning tasks when used alone or together with chain-of-thought prompting.

Core claim

A pre-trained LLM agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning and reinforcement learning approaches on the MiniWoB++ benchmark, using only a handful of demonstrations per task rather than tens of thousands and without a task-specific reward function.

What carries the argument

The RCI prompting scheme, in which the model is instructed to critique its own previous output and then produce an improved version, applied recursively until a satisfactory action sequence is reached.

If this is right

  • New computer tasks can be automated without collecting large expert demonstration sets or designing per-task reward functions.
  • The same RCI procedure also raises accuracy on pure natural-language reasoning benchmarks when used by itself or combined with chain-of-thought prompting.
  • Performance improves when RCI is applied on top of chain-of-thought prompting rather than using either technique alone.
  • A single pre-trained model can be reused across many distinct web-based tasks after seeing only a few examples of each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-critique loop could be extended to longer-horizon desktop tasks that involve multiple applications rather than single web pages.
  • If the underlying model is updated with more recent training data, the number of demonstrations needed per task might drop even further.
  • RCI could be combined with external verification tools, such as executing proposed actions in a sandbox, to catch errors the model itself does not notice.

Load-bearing premise

The pre-trained language model already possesses enough built-in knowledge about computer interfaces and the ability to critique its own reasoning so that a few demonstrations plus the RCI template suffice for it to generate correct actions on new tasks.

What would settle it

Finding that RCI prompting produces no measurable gain over ordinary few-shot prompting when the same InstructGPT-3 model is tested on a fresh set of MiniWoB++ tasks whose required mouse and keyboard sequences are absent from its training distribution.

read the original abstract

Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting with external feedback. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https://github.com/posgnu/rci-agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces Recursive Criticism and Improvement (RCI) prompting for pre-trained LLMs to solve computer tasks. It claims that RCI with InstructGPT-3+RLHF reaches state-of-the-art on the MiniWoB++ benchmark using only a handful of demonstrations per task and no task-specific reward function, outperforming prior LLM methods as well as supervised learning and reinforcement learning baselines. The work also reports that RCI improves LLM reasoning on natural language tasks and that combining RCI with chain-of-thought prompting yields further gains.

Significance. If the reported results hold, this is a significant empirical demonstration that general-purpose LLMs can automate a range of computer tasks with minimal task-specific data or engineering. The explicit multi-LLM comparisons, direct contrasts against SL/RL baselines, and public release of code at https://github.com/posgnu/rci-agent are clear strengths that support reproducibility and allow the community to verify and build on the findings.

minor comments (3)
  1. The abstract refers to 'a handful of demonstrations per task' without a precise count; the main experimental section should state the exact number of demonstrations used for each MiniWoB++ task.
  2. The prompting templates for RCI and the RCI+CoT variant are described at a high level; including the full template text or pseudocode in an appendix would improve reproducibility.
  3. Table captions and axis labels in the result figures would benefit from explicit mention of the evaluation metric (success rate) and the number of evaluation episodes per task.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of our work's significance, and the recommendation to accept. We appreciate the note on the strengths of our multi-LLM comparisons, contrasts to SL/RL baselines, and public code release.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a purely empirical prompting technique (RCI) evaluated on the MiniWoB++ benchmark against SL/RL baselines and other LLMs. No mathematical derivations, first-principles results, or equations are claimed; the central results consist of direct experimental comparisons using a handful of demonstrations and a fixed prompting template. All load-bearing claims are supported by reported benchmark gains rather than any self-referential definitions, fitted parameters renamed as predictions, or self-citation chains, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current LLMs possess latent reasoning and self-correction abilities that prompting can reliably surface for sequential decision tasks.

axioms (1)
  • domain assumption Large language models possess general reasoning capabilities that can be elicited through prompting.
    The RCI method assumes the base model can generate useful self-critiques and improvements without task-specific fine-tuning.

pith-pipeline@v0.9.0 · 5555 in / 1262 out tokens · 36587 ms · 2026-05-17T12:11:24.124658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Instruction Tuning with GPT-4

    cs.CL 2023-04 unverdicted novelty 8.0

    GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

  2. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    cs.LG 2024-03 unverdicted novelty 7.0

    WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.

  3. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  4. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  5. Reflexion: Language Agents with Verbal Reinforcement Learning

    cs.AI 2023-03 conditional novelty 7.0

    Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.

  6. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  7. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  8. Training Language Models to Self-Correct via Reinforcement Learning

    cs.LG 2024-09 unverdicted novelty 6.0

    SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

  9. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  10. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    cs.HC 2024-01 unverdicted novelty 6.0

    SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

  11. GPT-4V(ision) is a Generalist Web Agent, if Grounded

    cs.IR 2024-01 conditional novelty 6.0

    GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.

  12. Cognitive Architectures for Language Agents

    cs.AI 2023-09 accept novelty 6.0

    CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...

  13. Gorilla: Large Language Model Connected with Massive APIs

    cs.CL 2023-05 conditional novelty 6.0

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  14. ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

    cs.CL 2023-05 conditional novelty 6.0

    ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.

  15. Teaching Large Language Models to Self-Debug

    cs.CL 2023-04 unverdicted novelty 6.0

    Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.

  16. Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

    cs.CL 2025-03 unverdicted novelty 5.0

    Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.

  17. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

  18. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    cs.CV 2023-09 conditional novelty 4.0

    GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 18 Pith papers · 19 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 10

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736, 2022

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  4. [4]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems , 35:24639–24654, 2022

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems , 33:1877–1901, 2020

  6. [6]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  7. [7]

    Grounding large language models in interactive environments with online reinforcement learning

    Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre- Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. arXiv preprint arXiv:2302.02662, 2023

  8. [8]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Faithful reasoning using large language models

    Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

  11. [11]

    arXiv preprint arXiv:2205.09712 , year=

    Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022

  12. [12]

    Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022

  13. [13]

    Collaborating with language models for embodied reasoning

    Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. Collaborating with language models for embodied reasoning. In Second Workshop on Language and Reinforcement Learning, 2022

  14. [14]

    Language model cascades

    David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022

  15. [15]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2020

  16. [16]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  17. [17]

    GLaM: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning , pages 5547–5569. PMLR, 2022. 11

  18. [18]

    Minedojo: Building open-ended em- bodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended em- bodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022

  19. [19]

    Instruction-finetuned foundation models for multimodal web navigation

    Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. Instruction-finetuned foundation models for multimodal web navigation. In Workshop on Reincarnating Reinforcement Learning at ICLR , 2023

  20. [20]

    arXiv preprint arXiv:2302.07459 , year=

    Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023

  21. [21]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021

  22. [22]

    Improving alignment of dialogue agents via targeted human judgements

    Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022

  23. [23]

    Environment generation for zero-shot compositional reinforcement learning

    Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust. Environment generation for zero-shot compositional reinforcement learning. Advances in Neural Information Processing Systems , 34:4157–4169, 2021

  24. [24]

    Understanding html with large language models

    Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowd- hery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding HTML with large language models. arXiv preprint arXiv:2210.03945, 2022

  25. [25]

    Learning to navigate the web

    Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek Hakkani-Tur. Learning to navigate the web. In International Conference on Learning Representations , 2019

  26. [26]

    An empirical analysis of compute-optimal large language model training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022

  27. [27]

    Learning to solve arithmetic word problems with verb categorization

    Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533, 2014

  28. [28]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

  29. [29]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In6th Annual Conference on Robot Learning, 2022

  30. [30]

    A data-driven approach for learning to control computers

    Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. InInternational Conference on Machine Learning, pages 9466–9482. PMLR, 2022

  31. [31]

    Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts

    Taichi Iki and Akiko Aizawa. Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts. arXiv preprint arXiv:2203.07828, 2022

  32. [32]

    DOM-Q-NET: Grounded RL on structured language

    Sheng Jia, Jamie Ryan Kiros, and Jimmy Ba. DOM-Q-NET: Grounded RL on structured language. In International Conference on Learning Representations , 2019

  33. [33]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, 2022. 12

  34. [34]

    Parsing algebraic word problems into equations

    Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Du- mas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015

  35. [35]

    Program induction by rationale generation: Learning to solve and explain algebraic word problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. Proceedings of ACL, 2017

  36. [36]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018

  37. [37]

    Mind’s eye: Grounded language model reasoning through simulation

    Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush V osoughi, Claire Cui, Denny Zhou, and Andrew M Dai. Mind’s eye: Grounded language model reasoning through simulation. In International Conference on Learning Representations , 2023

  38. [38]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

  39. [39]

    Text and patterns: For effective chain of thought, it takes two to tango

    Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022

  40. [40]

    Teaching language models to support answers with verified quotes

    Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

  41. [41]

    Augmented Language Models: a Survey

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023

  42. [42]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  43. [43]

    End-to-end goal-driven web navigation

    Rodrigo Nogueira and Kyunghyun Cho. End-to-end goal-driven web navigation. Advances in Neural Information Processing Systems, 29, 2016

  44. [44]

    Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling

    Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, and Roy Fox. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050, 2023

  45. [45]

    Show your work: Scratchpads for intermediate computation with language models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop at ICLR, 2022

  46. [46]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  47. [47]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

  48. [48]

    ART: Automatic multi-step reasoning and tool-use for large language models

    Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

  49. [49]

    Mapping natural language commands to web elements

    Panupong Pasupat, Tian-Shun Jiang, Evan Liu, Kelvin Guu, and Percy Liang. Mapping natural language commands to web elements. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4970–4976, 2018. 13

  50. [50]

    Zero-shot entity extraction from web pages

    Panupong Pasupat and Percy Liang. Zero-shot entity extraction from web pages. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 391–401, 2014

  51. [51]

    Are nlp models really able to solve simple math word problems? Proceedings of NAACL, 2021

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? Proceedings of NAACL, 2021

  52. [53]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

  53. [54]

    Planning with large language models via corrective re-prompting

    Shreyas Sundara Raman, Vanya Cohen, Eric Rosen, Ifrah Idrees, David Paulius, and Stefanie Tellex. Planning with large language models via corrective re-prompting. F oundation Models for Decision Making workshop at NeurIPS , 2022

  54. [55]

    A generalist agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research, 2022

  55. [56]

    Solving general arithmetic word problems

    Subhro Roy and Dan Roth. Solving general arithmetic word problems. EMNLP, 2016

  56. [57]

    Multitask prompted training enables zero-shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022

  57. [58]

    Self-critiquing models for assisting human evaluators

    William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022

  58. [59]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  59. [60]

    Memory augmented large language models are computationally universal

    Dale Schuurmans. Memory augmented large language models are computationally universal. arXiv preprint arXiv:2301.04589, 2023

  60. [61]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017

  61. [62]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

  62. [63]

    CLIPort: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022

  63. [64]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing NLG 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022

  64. [65]

    Learning web- based procedures by reasoning over explanations and demonstrations in context

    Shashank Srivastava, Oleksandr Polozov, Nebojsa Jojic, and Christopher Meek. Learning web- based procedures by reasoning over explanations and demonstrations in context. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 7652–7662, 2020

  65. [66]

    Recitation-augmented language models

    Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In International Conference on Learning Representations , 2023. 14

  66. [67]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. Proceedings of NAACL-HLT, 2019

  67. [68]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

  68. [69]

    Transformers learn in-context by gradient descent

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mord- vintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022

  69. [70]

    Self- consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. In International Confer- ence on Learning Representations, 2023

  70. [71]

    Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

    Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023

  71. [72]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations , 2022

  72. [73]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems , 2022

  73. [74]

    Generating sequences by learning to self-correct

    Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053, 2022

  74. [75]

    Chain of thought im- itation with procedure cloning

    Mengjiao Sherry Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. Chain of thought im- itation with procedure cloning. Advances in Neural Information Processing Systems , 35:36366– 36381, 2022

  75. [76]

    Foun- dation models for decision making: Problems, methods, and opportunities

    Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foun- dation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023

  76. [77]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik R Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, 2022

  77. [78]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

  78. [79]

    STaR: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems , 35:15476–15488, 2022

  79. [80]

    Socratic models: Composing zero-shot multimodal reasoning with language

    Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. InInternational Conference on Learning Representations, 2023

  80. [81]

    Gonzalez

    Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E. Gonzalez. TEM- PERA: Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023

Showing first 80 references.