Recognition: 2 theorem links
· Lean TheoremPAL: Program-aided Language Models
Pith reviewed 2026-05-15 04:58 UTC · model grok-4.3
The pith
LLMs generate programs as reasoning steps and let a Python interpreter execute them to solve math and symbolic problems more accurately than much larger models using chain-of-thought.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAL uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter.
What carries the argument
LLM-generated program that encodes the full reasoning trace and is executed by a Python interpreter to produce the final answer.
If this is right
- On GSM8K, Codex-powered PAL reaches state-of-the-art few-shot accuracy and exceeds PaLM-540B chain-of-thought by 15 absolute points.
- The same program-plus-interpreter pattern improves accuracy on thirteen other mathematical, symbolic, and algorithmic tasks from BIG-Bench Hard and related benchmarks.
- The LLM no longer needs to perform arithmetic or symbolic execution inside its own generations, reducing a major source of error.
- Smaller models paired with an interpreter can outperform much larger models that attempt both decomposition and solution internally.
Where Pith is reading between the lines
- The approach may extend naturally to any domain where a reliable interpreter exists for the operations the model must perform.
- Combining program generation with other prompting techniques could further reduce remaining decomposition errors.
- The separation of concerns suggests that future models could be trained primarily to emit correct programs rather than to simulate execution.
Load-bearing premise
The language model will produce programs whose logic exactly matches the intended reasoning and that run without introducing coding or planning mistakes of its own.
What would settle it
A held-out set of word problems on which the generated programs execute cleanly yet return systematically wrong answers because the program logic diverges from the correct decomposition.
read the original abstract
Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Program-Aided Language Models (PAL), where an LLM generates Python programs as intermediate reasoning steps for natural language problems and delegates execution to a runtime interpreter. It evaluates the approach across 13 mathematical, symbolic, and algorithmic reasoning tasks drawn from BIG-Bench Hard and other benchmarks, claiming consistent accuracy gains over strong baselines and a 15-point absolute improvement on GSM8K few-shot accuracy relative to PaLM-540B using chain-of-thought.
Significance. If the results hold, the work is significant because it demonstrates a practical hybrid neural-symbolic method that improves reasoning accuracy without requiring larger model scale. The public release of code and data is a clear strength that supports reproducibility and further research on this paradigm.
major comments (2)
- [Experimental Results] Experimental section: the reported 15% absolute GSM8K gain over PaLM-540B CoT is presented without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the improvement is reliable or could be explained by prompt variance.
- [Method] Method and results: no ablation is provided that isolates the contribution of the interpreter execution from the LLM's program-generation quality (e.g., by comparing PAL against an LLM that generates the same programs but solves them internally), which is load-bearing for the central claim that off-loading computation improves accuracy.
minor comments (2)
- [Abstract] The abstract states results on '13 tasks' but does not enumerate them; adding a short list or reference to the table that defines the suite would improve readability.
- [Figure 1] Figure 1 (or equivalent diagram) would benefit from clearer labeling of the exact interface between the LLM output and the Python interpreter call.
Simulated Author's Rebuttal
We are grateful to the referee for their positive summary and recommendation for major revision. The comments highlight important aspects for improving the clarity and rigor of our experimental results and method. We address each point below and have incorporated revisions accordingly.
read point-by-point responses
-
Referee: [Experimental Results] Experimental section: the reported 15% absolute GSM8K gain over PaLM-540B CoT is presented without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the improvement is reliable or could be explained by prompt variance.
Authors: We agree with the referee that the lack of error bars and statistical tests makes it difficult to fully assess the reliability of the reported improvement. In the revised manuscript, we now include results from multiple runs with different random seeds for the few-shot prompt ordering on GSM8K. We report the mean and standard deviation, and perform a statistical test to confirm the significance of the 15-point gain over PaLM-540B CoT. This revision directly addresses the concern regarding prompt variance. revision: yes
-
Referee: [Method] Method and results: no ablation is provided that isolates the contribution of the interpreter execution from the LLM's program-generation quality (e.g., by comparing PAL against an LLM that generates the same programs but solves them internally), which is load-bearing for the central claim that off-loading computation improves accuracy.
Authors: We appreciate the suggestion to include an ablation isolating the interpreter's contribution. While our original comparisons to chain-of-thought already demonstrate the advantage of using programs over text-based reasoning, we have added the requested ablation in the revision. We compare against a setting where the LLM generates the program and then attempts to solve it by simulating the execution in its own generations. The results show that this internal solving leads to lower accuracy due to arithmetic errors, whereas the interpreter ensures correctness, thereby validating the benefit of off-loading to the runtime. revision: yes
Circularity Check
No significant circularity; empirical results stand on direct benchmark comparisons
full rationale
The paper introduces PAL as a prompting technique where an LLM generates executable programs for reasoning tasks and delegates execution to an interpreter. No equations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on reported few-shot accuracies across 13 tasks (e.g., GSM8K surpassing PaLM-540B CoT by 15%), which are externally falsifiable via public benchmarks and code. No load-bearing self-citations, uniqueness theorems, or ansatzes reduce the method to its inputs by construction. The derivation chain is self-contained through experimental validation rather than mathematical reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate correct and executable programs for the described reasoning tasks when given few-shot examples
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
Teaching Language Models to Think in Code
ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
-
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...
-
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
-
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
-
Teaching Language Models to Think in Code
ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
-
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
An adaptive compute-optimal strategy for scaling LLM test-time compute achieves over 4x efficiency gains versus best-of-N and lets smaller models outperform 14x larger ones on some problems.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)
RL post-training lifts answer correctness on FHIR-AgentBench from 50% (o4-mini) to 77% with a cheaper Qwen3-8B CodeAct agent.
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
-
SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Lee, K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, P., Quia...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., and Hajishirzi, H. https://aclanthology.org/N19-1245 M ath QA : Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In ACL, 2019
work page 2019
-
[3]
Giving bert a calculator: Finding operations and arguments with reading comprehension
Andor, D., He, L., Lee, K., and Pitler, E. Giving bert a calculator: Finding operations and arguments with reading comprehension. arXiv preprint arXiv:1909.00109, 2019
-
[4]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert - Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...
work page 2020
-
[6]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 b
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Cheng, Z., Xie, T., Shi, P., Li, C., Nadkarni, R., Hu, Y., Xiong, C., Radev, D., Ostendorf, M., Zettlemoyer, L., Smith, N. A., and Yu, T. Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875, 2022
-
[9]
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levska...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training Verifiers to Solve Math Word Problems https://arxiv.org/abs/2110.14168. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Demeter, D. and Downey, D. Just add functions: A neural-symbolic language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 7634--7642, 2020
work page 2020
- [12]
-
[13]
S., Anuoluwapo, A., Bosselut, A., Chandu, K
Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Anuoluwapo, A., Bosselut, A., Chandu, K. R., Clinciu, M., Das, D., Dhole, K. D., Du, W., Durmus, E., Dušek, O., Emezue, C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., Jhamtani, H., Ji, Y., Jolly, S., Kale, M., Kumar, D., Ladhak, F., Madaan, A., Maddela, M., Mahajan, K., Maha...
-
[14]
Gellenbeck, E. M. and Cook, C. R. An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop, pp.\ 65--81. Ablex Publishing, Norwood, NJ, 1991
work page 1991
-
[15]
Neural module networks for reasoning over text
Gupta, N., Lin, K., Roth, D., Singh, S., and Gardner, M. Neural module networks for reasoning over text. arXiv preprint arXiv:1912.04971, 2019
-
[16]
Measuring mathematical problem solving with the MATH dataset, 2021
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset, 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe
work page 2021
-
[17]
The Curious Case of Neural Text Degeneration
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751. In ICLR, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[18]
Mawps: A math word problem repository
Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. Mawps: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1152--1157, 2016
work page 2016
-
[19]
Solving Quantitative Reasoning Problems with Language Models
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems https://arxiv.org/abs/1705.04146. arXiv preprint arXiv:1705.04146, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing https://arxiv.org/abs/2107.13586. arXiv preprint arXiv:2107.13586, 2021
-
[22]
Madaan, A. and Yazdanbakhsh, A. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022
-
[23]
Language models of code are few-shot commonsense learners
Madaan, A., Zhou, S., Alon, U., Yang, Y., and Neubig, G. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128, 2022
-
[24]
Deep Learning: A Critical Appraisal
Marcus, G. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
The next decade in ai: four steps towards robust artificial intelligence
Marcus, G. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177, 2020
-
[26]
A diverse corpus for evaluating and developing E nglish math word problem solvers
Miao, S.-y., Liang, C.-C., and Su, K.-Y. A diverse corpus for evaluating and developing E nglish math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 975--984, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.92. URL https://aclanthology.org/2...
-
[27]
Lila: A unified benchmark for mathematical reasoning
Mishra, S., Finlayson, M., Lu, P., Tang, L., Welleck, S., Baral, C., Rajpurohit, T., Tafjord, O., Sabharwal, A., Clark, P., and Kalyan, A. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
work page 2022
-
[28]
Investigating the limitations of transformers with simple arithmetic tasks
Nogueira, R., Jiang, Z., and Lin, J. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021
-
[29]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. Show your Work: Scratchpads for Intermediate Computation with Language Models https://arxiv.org/abs/2112.00114. arXiv preprint arXiv:2112.00114, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Are NLP Models really able to Solve Simple Math Word Problems?
Patel, A., Bhattamishra, S., and Goyal, N. Are NLP Models Really Able to Solve Simple Math Word Problems? https://arxiv.org/abs/2103.07191 arXiv preprint arXiv:2103.07191, 2021
-
[31]
Reasoning like program executors
Pi, X., Liu, Q., Chen, B., Ziyadi, M., Lin, Z., Gao, Y., Fu, Q., Lou, J.-G., and Chen, W. Reasoning like program executors. arXiv preprint arXiv:2201.11473, 2022
-
[32]
Limitations of language models in arithmetic and symbolic induction
Qian, J., Wang, H., Li, Z., Li, S., and Yan, X. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022
-
[33]
Reif, E., Ippolito, D., Yuan, A., Coenen, A., Callison-Burch, C., and Wei, J. A Recipe for Arbitrary Text Style Transfer with Large Language Models https://arxiv.org/pdf/2109.03910.pdf. arXiv preprint arXiv:2109.03910, 2021
-
[34]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Shin, R. and Van Durme, B. Few-shot semantic parsing with language models trained on code. arXiv preprint arXiv:2112.08696, 2021
-
[36]
H., Thomson, S., Chen, C., Roy, S., Platanios, E
Shin, R., Lin, C. H., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., and Van Durme, B. Constrained language models yield few-shot semantic parsers. arXiv preprint arXiv:2104.08768, 2021
-
[37]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E., Zhou, D., and Wei, J. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv, abs/2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Takang, A. A., Grubb, P. A., and Macredie, R. D. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4 0 (3): 0 143--167, 1996
work page 1996
-
[39]
Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747. arXiv preprints arXiv:2207.00747, 2022 a
-
[40]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., and Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171. arXiv preprint arXiv:2203.11171, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Finetuned Language Models Are Zero-Shot Learners
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned Language Models are Zero-shot Learners https://arxiv.org/pdf/2109.01652.pdf. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Wu, Y., Jiang, A. Q., Li, W., Rabe, M. N., Staats, C., Jamnik, M., and Szegedy, C. Autoformalization with Large Language Models https://arxiv.org/abs/2205.12615. arXiv preprint arXiv:2205.12615, 2022
-
[44]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Zhou, D., Sch \"a rli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., and Chi, E. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models https://arxiv.org/abs/2205.10625. arXiv preprint arXiv:2205.10625, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.