arxiv: 2211.12588 · v4 · submitted 2022-11-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, William W. Cohen, Xinyi Wang, Xueguang Ma

Pith reviewed 2026-05-12 16:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords program of thoughtsnumerical reasoningprompting methodschain of thoughtlanguage modelsmath word problemsfinancial question answeringprogram execution

0 comments

The pith

Language models solve numerical reasoning tasks more accurately when they generate programs for external execution rather than performing calculations in text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Program of Thoughts prompting has language models write out reasoning as executable code instead of natural language steps. An external computer then runs the code to get the numerical answer, separating planning from arithmetic. This approach improves performance by about 12% on average over chain-of-thought prompting across math word problems and financial question answering datasets. The gains hold for both few-shot examples and zero-shot direct prompting. Adding self-consistency decoding pushes results to state of the art on math datasets.

Core claim

The paper shows that by expressing the reasoning process as a program in a language like Python, the model can offload exact computation to an interpreter, leading to fewer errors in numerical tasks compared to keeping everything in text-based chain-of-thought reasoning.

What carries the argument

Program of Thoughts (PoT) prompting, a strategy that has the language model output a program representing the solution steps for execution by an external interpreter.

If this is right

Higher accuracy on five math word problem datasets including GSM, AQuA, SVAMP, TabMWP, and MultiArith.
Improved results on three financial QA datasets: FinQA, ConvFinQA, and TATQA.
Effective in both few-shot and zero-shot prompting setups.
When combined with self-consistency, achieves state-of-the-art on math problems and near state-of-the-art on financial problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Language models may be stronger at generating structured plans than at executing precise calculations internally.
The method could extend to other domains requiring accurate computation, such as scientific simulations or data analysis.
Using more capable code generation models might further increase the performance gap over text-based methods.

Load-bearing premise

The language model can produce programs that accurately reflect the intended reasoning steps without introducing logical or syntactic mistakes.

What would settle it

If executing the programs generated by the model on the test datasets produces no higher accuracy than chain-of-thought prompting, or if the programs frequently contain errors that lead to wrong answers, the proposed advantage would not hold.

read the original abstract

Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github https://github.com/wenhuchen/Program-of-Thoughts

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PoT gets better numbers than CoT by making the model write Python code and letting an interpreter handle the arithmetic.

read the letter

The main point is that this approach improves accuracy on numerical word problems by having the language model generate a program instead of doing the calculations inside its own text output. They test it on five math datasets like GSM8K and SVAMP plus three financial QA sets, showing roughly 12% average gains over chain-of-thought in both few-shot and zero-shot settings. Combining it with self-consistency reaches reported state-of-the-art on the math tasks and near that on the finance ones. The code and data are released, which helps with verification. The idea itself is a direct extension of prior code-generation work but applied specifically here to separate reasoning steps from execution. That framing and the financial dataset results are the clearest additions. The experiments look solid on the surface with consistent comparisons across setups. The main soft spot is that success still depends on the model producing programs that are both syntactically valid and semantically faithful to the problem. The abstract gives no breakdown of program error rates or cases where the code runs but encodes the wrong logic, so it is not clear how much of the lift comes from true disentangling versus just avoiding arithmetic mistakes that CoT makes in text. If the full paper has only end-task accuracy without that analysis, the attribution stays a bit loose. This is useful for anyone building prompting methods for quantitative tasks in education or finance applications. A reader looking for practical gains on standard benchmarks will find the results worth checking. I would send it for peer review because the core experiments are straightforward to evaluate and the method is easy to reproduce.

Referee Report

2 major / 2 minor

Summary. The paper proposes Program of Thoughts (PoT) prompting, in which a language model (primarily Codex) generates Python programs that encode the step-by-step reasoning for numerical word problems and financial QA tasks; an external interpreter then executes these programs to produce the final answer. This is contrasted with Chain-of-Thought (CoT) prompting, where the model performs both reasoning and arithmetic internally. The authors evaluate PoT against CoT (and other baselines) on five math datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial datasets (FinQA, ConvFinQA, TATQA) under both few-shot and zero-shot regimes, reporting an average ~12% absolute gain over CoT and state-of-the-art results when PoT is combined with self-consistency decoding. Code and data are released.

Significance. If the central empirical claims hold after verification, the work provides a practical and reproducible method for improving numerical reasoning accuracy by off-loading computation to an interpreter, which sidesteps arithmetic errors common in pure language-model generation. The public release of code and data is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[Experiments] Experiments section: the headline performance gains and SOTA claims rest on the premise that generated programs are syntactically valid and semantically faithful to the required reasoning steps. No breakdown is provided of program validity rates, types of generation errors (e.g., incorrect variable bindings, omitted steps, or off-by-one logic), or semantic fidelity to gold reasoning paths; end-task accuracy alone does not distinguish whether gains arise from disentangling computation or simply from avoiding arithmetic slips that CoT must perform internally.
[Tables 1-3] Tables 1–3 (few-shot and zero-shot results): the reported average 12% gain and per-dataset numbers lack error bars, standard deviations across multiple sampling runs, or statistical significance tests. Given the known stochasticity of Codex outputs, it is unclear whether the observed margins are robust or sensitive to prompt phrasing and decoding temperature.

minor comments (2)

[Abstract] The abstract states 'around 12%' without specifying whether this is a simple arithmetic mean or a weighted average across datasets of different sizes.
[§3] Notation for the program-generation prompt template could be made more explicit (e.g., by showing the exact few-shot exemplars used for each dataset) to aid exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline performance gains and SOTA claims rest on the premise that generated programs are syntactically valid and semantically faithful to the required reasoning steps. No breakdown is provided of program validity rates, types of generation errors (e.g., incorrect variable bindings, omitted steps, or off-by-one logic), or semantic fidelity to gold reasoning paths; end-task accuracy alone does not distinguish whether gains arise from disentangling computation or simply from avoiding arithmetic slips that CoT must perform internally.

Authors: We agree that a finer-grained analysis of program quality would strengthen the interpretation of our results. While the core advantage of PoT is that an external interpreter guarantees correct execution of whatever program is generated (eliminating the arithmetic errors that plague CoT), we acknowledge that end-task accuracy alone leaves open the question of how often the generated programs are faithful to the intended reasoning. In the revised manuscript we will add a new subsection under Experiments that reports (1) syntactic validity rates of generated programs on each dataset, (2) a categorization of the most frequent semantic errors (e.g., incorrect variable assignment, missing intermediate steps, off-by-one logic), and (3) a manual comparison of reasoning paths for a random sample of 100 problems against the gold solutions. This analysis will help readers assess whether the observed gains derive primarily from reliable computation or from improved reasoning structure as well. revision: yes
Referee: [Tables 1-3] Tables 1–3 (few-shot and zero-shot results): the reported average 12% gain and per-dataset numbers lack error bars, standard deviations across multiple sampling runs, or statistical significance tests. Given the known stochasticity of Codex outputs, it is unclear whether the observed margins are robust or sensitive to prompt phrasing and decoding temperature.

Authors: We recognize that Codex sampling is stochastic and that single-run numbers can be sensitive to temperature and prompt wording. In the revised version we will re-run the main few-shot and zero-shot experiments with three independent sampling seeds (temperature 0.7, as used in the original submission) and report mean accuracy together with standard deviation in Tables 1–3. We will also add a footnote or appendix table showing the results of paired t-tests between PoT and CoT on each dataset to establish statistical significance of the reported margins. These additions will directly address concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting comparison with independent evaluation

full rationale

The paper proposes PoT as an alternative prompting strategy to CoT, where the LM generates executable programs and an external interpreter performs the arithmetic. All reported gains (average ~12% over CoT, SoTA with self-consistency) are obtained from direct accuracy measurements on fixed public benchmarks (GSM, AQuA, SVAMP, TabMWP, MultiArith, FinQA, ConvFinQA, TATQA) under few-shot and zero-shot protocols. No equations, fitted parameters, or uniqueness theorems are invoked; the method is not derived from prior self-citations but is tested against external baselines. The released code and data make the results independently reproducible and falsifiable, satisfying the criteria for a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach introduces no free parameters or invented entities. It rests on the domain assumption that current code-generating language models can produce programs that accurately encode the necessary reasoning.

axioms (1)

domain assumption Language models such as Codex can generate syntactically valid and semantically correct programs that capture the reasoning process for the evaluated numerical tasks.
This assumption is required for the generated programs to yield correct answers when executed externally.

pith-pipeline@v0.9.0 · 5541 in / 1239 out tokens · 29919 ms · 2026-05-12T16:44:41.432300+00:00 · methodology

discussion (0)

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
cs.CV 2026-04 accept novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
cs.MA 2026-04 unverdicted novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
cs.SE 2026-04 conditional novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
Weighted Rules under the Stable Model Semantics
cs.AI 2026-05 unverdicted novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
cs.CL 2026-05 unverdicted novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
cs.AR 2026-04 unverdicted novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness,...
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
cs.CV 2026-04 unverdicted novelty 6.0

DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...
SinkTrack: Attention Sink based Context Anchoring for Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SinkTrack uses attention sink at the BOS token to anchor LLMs to initial context, reducing hallucination and forgetting with reported gains on benchmarks like SQuAD2.0 and M3CoT.
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
cs.LG 2026-04 unverdicted novelty 6.0

ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
cs.CL 2025-04 unverdicted novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
Multimodal Chain-of-Thought Reasoning in Language Models
cs.CL 2023-02 accept novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
cs.AI 2026-05 unverdicted novelty 5.0

EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
LLMs with in-context learning for Algorithmic Theoretical Physics
cs.LG 2026-05 unverdicted novelty 5.0

Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
cs.LG 2026-04 unverdicted novelty 5.0

TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp fo...
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Position: How can Graphs Help Large Language Models?
cs.AI 2026-05 unverdicted novelty 3.0

Graphs can help LLMs reduce hallucinations, boost reasoning via prompting techniques, and better process structured data.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
cs.AI 2024-02 unverdicted novelty 3.0

A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 35 Pith papers · 16 internal anchors

[1]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page 2019
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021a. Wenhu Chen. Large language models are few (1)-shot table reasoners.arXiv preprint arXiv:2210.06710 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Finqa: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3697–3711, 2021b. Zhiyu Chen, Shiyang Li, Charese S...

work page arXiv 2021
[6]

Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875, 2022

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875,

work page arXiv
[7]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311 ,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168 ,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Compositional semantic parsing with large language models.arXiv preprint arXiv:2209.15003,

12 Published in Transactions on Machine Learning Research (10/2023) Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. Compositional semantic parsing with large language models.arXiv preprint arXiv:2209.15003,

work page arXiv 2023
[10]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

work page 2019
[11]

Pal: Program-aided language models,

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435 ,

work page arXiv
[12]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738 ,

work page internal anchor Pith review arXiv
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.arXiv preprint arXiv:2205.11916 ,

work page internal anchor Pith review arXiv
[16]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610 ,

work page arXiv
[17]

arXiv preprint arXiv:2202.12837 , year=

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837,

work page arXiv
[18]

Lila: A unified benchmark for mathematical reasoning

13 Published in Transactions on Machine Learning Research (10/2023) Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpuro- hit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified benchmark for mathematical reasoning. InProceedings of the 2022 Conference on Empirical Methods in Na...

work page 2023
[19]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474,

work page internal anchor Pith review arXiv
[20]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114 ,

work page internal anchor Pith review arXiv
[21]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2303.09014 , year=

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014,

work page arXiv
[23]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 2080–2094, Online, June

work page 2021
[24]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology. org/2021.naacl-main.168. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from trainin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.naacl-main.168 2021
[25]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pp. 1743–1752,

work page 2015
[26]

Analysing mathematical reasoning abilities of neural models

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models.arXiv preprint arXiv:1904.01557 ,

work page arXiv 1904
[27]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990,

work page arXiv
[29]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261 ,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

A numerical reasoning question answering system with fine-grained retriever and the ensemble of multiple generators for finqa

Bin Wang, Jiangzhou Ju, Yunlin Mao, Xin-Yu Dai, Shujian Huang, and Jiajun Chen. A numerical reasoning question answering system with fine-grained retriever and the ensemble of multiple generators for finqa. arXiv preprint arXiv:2206.08506 , 2022a. 14 Published in Transactions on Machine Learning Research (10/2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang ...

work page arXiv 2023
[32]

Available: https://arxiv.org/abs/2305.00633

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Decomposition enhances reasoning via self-evaluation guided decoding.arXiv preprint arXiv:2305.00633 ,

work page arXiv
[33]

React: Syner- gizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Syner- gizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop,

work page 2022
[34]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 ,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

15 Published in Transactions on Machine Learning Research (10/2023) 7 Appendix 7.1 PoT as intermediate step We demonstrate the workflow in Figure

work page 2023