Recognition: 2 theorem links
· Lean TheoremProgram of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Pith reviewed 2026-05-12 16:44 UTC · model grok-4.3
The pith
Language models solve numerical reasoning tasks more accurately when they generate programs for external execution rather than performing calculations in text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that by expressing the reasoning process as a program in a language like Python, the model can offload exact computation to an interpreter, leading to fewer errors in numerical tasks compared to keeping everything in text-based chain-of-thought reasoning.
What carries the argument
Program of Thoughts (PoT) prompting, a strategy that has the language model output a program representing the solution steps for execution by an external interpreter.
If this is right
- Higher accuracy on five math word problem datasets including GSM, AQuA, SVAMP, TabMWP, and MultiArith.
- Improved results on three financial QA datasets: FinQA, ConvFinQA, and TATQA.
- Effective in both few-shot and zero-shot prompting setups.
- When combined with self-consistency, achieves state-of-the-art on math problems and near state-of-the-art on financial problems.
Where Pith is reading between the lines
- Language models may be stronger at generating structured plans than at executing precise calculations internally.
- The method could extend to other domains requiring accurate computation, such as scientific simulations or data analysis.
- Using more capable code generation models might further increase the performance gap over text-based methods.
Load-bearing premise
The language model can produce programs that accurately reflect the intended reasoning steps without introducing logical or syntactic mistakes.
What would settle it
If executing the programs generated by the model on the test datasets produces no higher accuracy than chain-of-thought prompting, or if the programs frequently contain errors that lead to wrong answers, the proposed advantage would not hold.
read the original abstract
Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github https://github.com/wenhuchen/Program-of-Thoughts
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Program of Thoughts (PoT) prompting, in which a language model (primarily Codex) generates Python programs that encode the step-by-step reasoning for numerical word problems and financial QA tasks; an external interpreter then executes these programs to produce the final answer. This is contrasted with Chain-of-Thought (CoT) prompting, where the model performs both reasoning and arithmetic internally. The authors evaluate PoT against CoT (and other baselines) on five math datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial datasets (FinQA, ConvFinQA, TATQA) under both few-shot and zero-shot regimes, reporting an average ~12% absolute gain over CoT and state-of-the-art results when PoT is combined with self-consistency decoding. Code and data are released.
Significance. If the central empirical claims hold after verification, the work provides a practical and reproducible method for improving numerical reasoning accuracy by off-loading computation to an interpreter, which sidesteps arithmetic errors common in pure language-model generation. The public release of code and data is a clear strength that supports reproducibility and follow-up work.
major comments (2)
- [Experiments] Experiments section: the headline performance gains and SOTA claims rest on the premise that generated programs are syntactically valid and semantically faithful to the required reasoning steps. No breakdown is provided of program validity rates, types of generation errors (e.g., incorrect variable bindings, omitted steps, or off-by-one logic), or semantic fidelity to gold reasoning paths; end-task accuracy alone does not distinguish whether gains arise from disentangling computation or simply from avoiding arithmetic slips that CoT must perform internally.
- [Tables 1-3] Tables 1–3 (few-shot and zero-shot results): the reported average 12% gain and per-dataset numbers lack error bars, standard deviations across multiple sampling runs, or statistical significance tests. Given the known stochasticity of Codex outputs, it is unclear whether the observed margins are robust or sensitive to prompt phrasing and decoding temperature.
minor comments (2)
- [Abstract] The abstract states 'around 12%' without specifying whether this is a simple arithmetic mean or a weighted average across datasets of different sizes.
- [§3] Notation for the program-generation prompt template could be made more explicit (e.g., by showing the exact few-shot exemplars used for each dataset) to aid exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our work. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline performance gains and SOTA claims rest on the premise that generated programs are syntactically valid and semantically faithful to the required reasoning steps. No breakdown is provided of program validity rates, types of generation errors (e.g., incorrect variable bindings, omitted steps, or off-by-one logic), or semantic fidelity to gold reasoning paths; end-task accuracy alone does not distinguish whether gains arise from disentangling computation or simply from avoiding arithmetic slips that CoT must perform internally.
Authors: We agree that a finer-grained analysis of program quality would strengthen the interpretation of our results. While the core advantage of PoT is that an external interpreter guarantees correct execution of whatever program is generated (eliminating the arithmetic errors that plague CoT), we acknowledge that end-task accuracy alone leaves open the question of how often the generated programs are faithful to the intended reasoning. In the revised manuscript we will add a new subsection under Experiments that reports (1) syntactic validity rates of generated programs on each dataset, (2) a categorization of the most frequent semantic errors (e.g., incorrect variable assignment, missing intermediate steps, off-by-one logic), and (3) a manual comparison of reasoning paths for a random sample of 100 problems against the gold solutions. This analysis will help readers assess whether the observed gains derive primarily from reliable computation or from improved reasoning structure as well. revision: yes
-
Referee: [Tables 1-3] Tables 1–3 (few-shot and zero-shot results): the reported average 12% gain and per-dataset numbers lack error bars, standard deviations across multiple sampling runs, or statistical significance tests. Given the known stochasticity of Codex outputs, it is unclear whether the observed margins are robust or sensitive to prompt phrasing and decoding temperature.
Authors: We recognize that Codex sampling is stochastic and that single-run numbers can be sensitive to temperature and prompt wording. In the revised version we will re-run the main few-shot and zero-shot experiments with three independent sampling seeds (temperature 0.7, as used in the original submission) and report mean accuracy together with standard deviation in Tables 1–3. We will also add a footnote or appendix table showing the results of paired t-tests between PoT and CoT on each dataset to establish statistical significance of the reported margins. These additions will directly address concerns about robustness. revision: yes
Circularity Check
No circularity: empirical prompting comparison with independent evaluation
full rationale
The paper proposes PoT as an alternative prompting strategy to CoT, where the LM generates executable programs and an external interpreter performs the arithmetic. All reported gains (average ~12% over CoT, SoTA with self-consistency) are obtained from direct accuracy measurements on fixed public benchmarks (GSM, AQuA, SVAMP, TabMWP, MultiArith, FinQA, ConvFinQA, TATQA) under few-shot and zero-shot protocols. No equations, fitted parameters, or uniqueness theorems are invoked; the method is not derived from prior self-citations but is tested against external baselines. The released code and data make the results independently reproducible and falsifiable, satisfying the criteria for a self-contained empirical claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language models such as Codex can generate syntactically valid and semantically correct programs that capture the reasoning process for the evaluated numerical tasks.
Forward citations
Cited by 35 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
-
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
-
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness,...
-
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...
-
SinkTrack: Attention Sink based Context Anchoring for Large Language Models
SinkTrack uses attention sink at the BOS token to anchor LLMs to initial context, reducing hallucination and forgetting with reported gains on benchmarks like SQuAD2.0 and M3CoT.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
-
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
-
ToolRL: Reward is All Tool Learning Needs
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
-
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
-
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp fo...
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
Position: How can Graphs Help Large Language Models?
Graphs can help LLMs reduce hallucinations, boost reasoning via prompting techniques, and better process structured data.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...
Reference graph
Works this paper leans on
-
[1]
Mathqa: Towards interpretable math word problem solving with operation-based formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...
work page 2019
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,
work page 1901
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021a. Wenhu Chen. Large language models are few (1)-shot table reasoners.arXiv preprint arXiv:2210.06710 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Finqa: A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3697–3711, 2021b. Zhiyu Chen, Shiyang Li, Charese S...
-
[6]
Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875, 2022
Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875,
-
[7]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Compositional semantic parsing with large language models.arXiv preprint arXiv:2209.15003,
12 Published in Transactions on Machine Learning Research (10/2023) Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. Compositional semantic parsing with large language models.arXiv preprint arXiv:2209.15003,
-
[10]
Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...
work page 2019
-
[11]
Pal: Program-aided language models,
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435 ,
-
[12]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738 ,
work page internal anchor Pith review arXiv
-
[13]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[15]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.arXiv preprint arXiv:2205.11916 ,
work page internal anchor Pith review arXiv
-
[16]
Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610 ,
-
[17]
arXiv preprint arXiv:2202.12837 , year=
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837,
-
[18]
Lila: A unified benchmark for mathematical reasoning
13 Published in Transactions on Machine Learning Research (10/2023) Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpuro- hit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified benchmark for mathematical reasoning. InProceedings of the 2022 Conference on Empirical Methods in Na...
work page 2023
-
[19]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474,
work page internal anchor Pith review arXiv
-
[20]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114 ,
work page internal anchor Pith review arXiv
-
[21]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
arXiv preprint arXiv:2303.09014 , year=
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014,
-
[23]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 2080–2094, Online, June
work page 2021
-
[24]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology. org/2021.naacl-main.168. Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from trainin...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.naacl-main.168 2021
-
[25]
Solving general arithmetic word problems
Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pp. 1743–1752,
work page 2015
-
[26]
Analysing mathematical reasoning abilities of neural models
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models.arXiv preprint arXiv:1904.01557 ,
-
[27]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990,
-
[29]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
LaMDA: Language Models for Dialog Applications
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Bin Wang, Jiangzhou Ju, Yunlin Mao, Xin-Yu Dai, Shujian Huang, and Jiajun Chen. A numerical reasoning question answering system with fine-grained retriever and the ensemble of multiple generators for finqa. arXiv preprint arXiv:2206.08506 , 2022a. 14 Published in Transactions on Machine Learning Research (10/2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang ...
-
[32]
Available: https://arxiv.org/abs/2305.00633
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Decomposition enhances reasoning via self-evaluation guided decoding.arXiv preprint arXiv:2305.00633 ,
-
[33]
React: Syner- gizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Syner- gizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop,
work page 2022
-
[34]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
15 Published in Transactions on Machine Learning Research (10/2023) 7 Appendix 7.1 PoT as intermediate step We demonstrate the workflow in Figure
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.