Recognition: no theorem link
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Pith reviewed 2026-05-10 12:49 UTC · model grok-4.3
The pith
Chain of thought prompting lets large language models reach state-of-the-art accuracy on math word problems using only eight examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generating a chain of thought, a series of intermediate reasoning steps, significantly improves the ability of large language models to perform complex reasoning. Such reasoning abilities emerge naturally in sufficiently large language models via chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in the prompt.
What carries the argument
Chain of thought prompting: the inclusion of a small number of input examples that each show a sequence of explicit reasoning steps before the final answer.
If this is right
- A 540B model with eight chain-of-thought exemplars reaches state-of-the-art accuracy on GSM8K, beating fine-tuned GPT-3 with a verifier.
- The same prompting method improves results across arithmetic, commonsense, and symbolic reasoning benchmarks.
- Reasoning performance scales with model size once chain-of-thought exemplars are supplied.
- Complex tasks become solvable without retraining or additional fine-tuning data.
Where Pith is reading between the lines
- The method could extend to problems that require much longer reasoning chains if the prompt examples are lengthened accordingly.
- It opens a route to more interpretable model outputs by making the generated steps visible to users.
- Standard few-shot prompting may systematically underestimate what current models can do on reasoning benchmarks.
- Varying the structure of the reasoning steps in the examples would test whether models follow the logic or merely copy surface patterns.
Load-bearing premise
The performance gains are caused by the explicit reasoning steps rather than by simply supplying longer or more detailed prompts in general.
What would settle it
A controlled test in which prompts of matched length contain no reasoning steps but still produce the same accuracy gains on GSM8K would falsify the claim.
read the original abstract
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces chain-of-thought (CoT) prompting, in which few-shot exemplars are augmented with explicit sequences of intermediate reasoning steps. Experiments across three large language models (including PaLM-540B) demonstrate that this format yields substantial accuracy gains on arithmetic, commonsense, and symbolic reasoning benchmarks relative to standard few-shot prompting. The headline result is that eight fixed CoT exemplars enable the 540B model to reach state-of-the-art accuracy on GSM8K, surpassing a fine-tuned GPT-3 model equipped with a verifier. Gains are shown to emerge only at large scale, with supporting ablations and multiple runs on most tasks.
Significance. If the empirical results hold, the work is significant because it shows that complex reasoning capabilities can be elicited from LLMs via a simple, training-free prompting change. The consistent improvements across task families, the clear scaling threshold, the use of held-out test sets, and the reporting of multiple runs and error bars provide solid grounding. The approach has immediate practical value for deploying LLMs on reasoning problems and raises interesting questions about how reasoning emerges in large models.
minor comments (4)
- [§3.1] The answer-extraction procedure for GSM8K (and similar math tasks) should be described more explicitly, including how cases where the model fails to produce a boxed final answer are handled and whether any post-processing rules were tuned on the test set.
- [Figure 2] Figure 2 (scaling curves) would benefit from error bars on every point rather than only on selected runs; this would make the emergence threshold at large scale easier to assess visually.
- [§4.3] The paper compares CoT prompting against standard few-shot baselines but does not include a control that matches prompt length while removing the reasoning structure (e.g., repeated filler sentences). While the existing ablations make a length-only explanation unlikely, this additional control would further isolate the contribution of the reasoning format.
- [Appendix B] A brief discussion of how exemplar selection was performed (random vs. curated) and whether results are sensitive to the particular eight GSM8K exemplars would strengthen reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for their thorough summary of our work, for highlighting its significance, and for recommending acceptance. No major comments or criticisms were raised in the report.
Circularity Check
No significant circularity
full rationale
The paper presents purely empirical results from prompting experiments on large language models, with all accuracy metrics measured on held-out standard benchmarks such as GSM8K. No equations, derivations, or fitted parameters appear that could reduce claimed gains to quantities defined by the same inputs. The method is a straightforward prompting technique whose effects are directly observed via comparisons to few-shot baselines, and no self-citation chain or uniqueness theorem is invoked to justify the central claims. The derivation chain is therefore self-contained and consists only of experimental observations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can learn to imitate reasoning patterns shown in a small number of in-context examples
Forward citations
Cited by 60 Pith papers
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Stability and Generalization in Looped Transformers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
-
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
-
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
Navigating the Conceptual Multiverse
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...
-
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
IE as Cache: Information Extraction Enhanced Agentic Reasoning
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
-
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints
Banning filler words like 'very' and 'just' improved LLM reasoning by 6.7 percentage points while E-Prime improved it by only 3.7, with gains ranking in exact inverse order of theoretical depth across models and tasks.
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Let's Verify Step by Step
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
-
Continuous Latent Contexts Enable Efficient Online Learning in Transformers
Transformers equipped with continuous latent context tokens can implement foundational online decision-making algorithms such as weighted majority and Q-learning, and a trained small model outperforms larger LLMs on s...
-
Let the Target Select for Itself: Data Selection via Target-Aligned Paths
Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial
Atomic fact-checking of LLM oncology recommendations increased clinician trust from 26.9% to 66.5% (Cohen's d=0.94) in a trial of 356 doctors.
-
Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach
A multi-agent LLM architecture with four artifact-driven roles produces ontologies from insurance contracts that have significantly better structural quality and modestly better queryability than a single-agent baseli...
-
Thinking with Reasoning Skills: Fewer Tokens, More Accuracy
Distilling and retrieving reusable reasoning skills lets LLMs solve coding and math problems with fewer tokens and higher accuracy.
-
You Don't Need Public Tests to Generate Correct Code
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
-
Job Skill Extraction via LLM-Centric Multi-Module Framework
SRICL combines semantic retrieval from ESCO, in-context learning, fine-tuning, and output verification to achieve higher STRICT-F1 scores and fewer invalid or hallucinated skill spans than GPT-3.5 baselines on six pub...
-
Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework
A new five-principle framework applied to 34 practitioner AI governance prompts finds 37% lack key structural elements such as data classification and rubrics.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
ShadowPEFT replaces distributed low-rank weight perturbations with a centralized, depth-shared shadow module that evolves parallel hidden states layer by layer, matching or beating LoRA and DoRA on generation and unde...
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than pe...
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
-
Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants
A symbolic protocol operationalizes Peirce's tripartite reasoning for LLMs using five algebraic invariants including a Weakest Link bound to enforce logical consistency and prevent weak premises from supporting strong...
-
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.
Reference graph
Works this paper leans on
-
[1]
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. 2022. https://arxiv.org/abs/2204.01691 Do as I can, not as I say: Grounding language in robotic affordances . arXiv preprint arXiv:2204.01691
work page internal anchor Pith review arXiv 2022
-
[2]
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. https://aclanthology.org/N19-1245 M ath QA : Towards interpretable math word problem solving with operation-based formalisms . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Hu...
work page 2019
-
[3]
Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. https://doi.org/10.18653/v1/D19-1609 Giving BERT a calculator: Finding operations and arguments with reading comprehension . EMNLP
-
[4]
Jacob Andreas, Dan Klein, and Sergey Levine. 2018. https://aclanthology.org/N18-1197 Learning with latent language . NAACL
work page 2018
-
[5]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . arXiv preprint arXiv:2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
BIG-bench collaboration . 2021. https://github.com/google/BIG-bench/ Beyond the imitation game: Measuring and extrapolating the capabilities of language models . In preparation
work page 2021
-
[7]
Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.506 Flexible generation of natural language deductions . EMNLP
-
[8]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
- [9]
- [10]
- [11]
-
[12]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. https://arxiv.org/abs/2107.03374 Evaluating large language models trained on code . arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2019. https://openreview.net/forum?id=ryxjnREFwH Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension . ICLR
work page 2019
-
[14]
Ting-Rui Chiang and Yun-Nung Chen. 2019. https://doi.org/10.18653/v1/N19-1272 Semantically-aligned equation generation for solving and reasoning math word problems . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 26...
-
[15]
Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. https://www.ijcai.org/proceedings/2020/0537.pdf Transformers as soft reasoners over language . IJCAI
work page 2020
-
[16]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://aclanthology.org/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . NAACL
work page 2019
- [18]
-
[19]
Dheeru Dua, Sameer Singh, and Matt Gardner. 2020. https://aclanthology.org/2020.acl-main.497 Benefits of intermediate annotations in reading comprehension . ACL
work page 2020
-
[20]
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. https://doi.org/10.1162/tacl_a_00370 Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . TACL
- [21]
-
[22]
Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . 2018. https://doi.org/10.18653/v1/P18-1175 Training classifiers with natural language explanations . ACL
- [23]
-
[24]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2103.03874 Measuring mathematical problem solving with the math dataset . arXiv preprint arXiv:2103.03874
work page internal anchor Pith review arXiv 2021
-
[25]
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. https://doi.org/10.3115/v1/D14-1058 Learning to solve arithmetic word problems with verb categorization . EMNLP
- [26]
-
[27]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://arxiv.org/abs/2001.08361 Scaling laws for neural language models . arXiv preprint arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[28]
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. https://doi.org/10.18653/v1/N16-1136 MAWPS : A math word problem repository . NAACL
-
[29]
arXiv preprint arXiv:2204.02329 , year=
Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. 2022. https://arxiv.org/abs/2204.02329 Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329
- [30]
-
[31]
Teven Le Scao and Alexander Rush. 2021. https://doi.org/10.18653/v1/2021.naacl-main.208 How many data points is a prompt worth? NAACL
-
[32]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.243 The power of scale for parameter-efficient prompt tuning . EMNLP
-
[33]
Iddo Lev, Bill MacCartney, Christopher Manning, and Roger Levy. 2004. https://aclanthology.org/W04-0902 Solving logic puzzles: From robust processing to precise semantics . Proceedings of the 2nd Workshop on Text Meaning and Interpretation
work page 2004
-
[34]
Xiang Lisa Li and Percy Liang. 2021. https://doi.org/10.18653/v1/2021.acl-long.353 Prefix-tuning: Optimizing continuous prompts for generation . ACL
-
[35]
Zhengzhong Liang, Steven Bethard, and Mihai Surdeanu. 2021. https://doi.org/10.18653/v1/2021.naacl-main.97 Explainable multi-hop verbal reasoning through internal monologue . NAACL
-
[36]
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. https://doi.org/10.18653/v1/P17-1015 Program induction by rationale generation: Learning to solve and explain algebraic word problems . ACL
- [37]
- [38]
- [39]
-
[40]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In ACL
-
[41]
Shen Yun Miao, Chao Chun Liang, and Keh Yih Su. 2020. https://doi.org/10.18653/v1/2020.acl-main.92 A diverse corpus for evaluating and developing E nglish math word problem solvers . ACL
- [42]
- [43]
-
[44]
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. http://arxiv.org/abs/2112.00114 Show your work: Scratchpads for intermediate computation with language models . arXiv preprint arXiv:2112.00114
work page internal anchor Pith review arXiv 2021
-
[45]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. https://arxiv.org/abs/2203.02155 Training language models to follow instructions with human feedback . arXiv preprint arXiv:2203.02155
work page internal anchor Pith review arXiv 2022
-
[46]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://aclanthology.org/2021.naacl-main.168.pdf Are NLP models really able to solve simple math word problems? NAACL
work page 2021
-
[47]
Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://aclanthology.org/N18-1202 Deep contextualized word representations . NAACL
work page 2018
- [48]
-
[49]
Piotr Pi e kos, Mateusz Malinowski, and Henryk Michalewski. 2021. https://doi.org/10.18653/v1/2021.acl-short.49 Measuring and improving BERT ' s mathematical abilities by predicting the order of reasoning. ACL
-
[50]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. https://arxiv.org/abs/2112.11446 Scaling language models: Methods, analysis & insights from training G opher . arXiv preprint arXiv:2112.11446
work page internal anchor Pith review arXiv 2021
-
[51]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. https://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21:1--67
work page internal anchor Pith review arXiv 2020
-
[52]
Dheeraj Rajagopal, Vidhisha Balachandran, Eduard H. Hovy, and Yulia Tsvetkov. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.64 SelfExplain : A self-explaining architecture for neural text classifiers . EMNLP
-
[53]
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. https://doi.org/10.18653/v1/P19-1487 Explain yourself! L everaging language models for commonsense reasoning . ACL
-
[54]
Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. https://doi.org/10.18653/v1/D19-1251 N um N et: Machine reading comprehension with numerical reasoning . EMNLP
- [55]
- [56]
- [57]
- [58]
-
[59]
Subhro Roy and Dan Roth. 2015. https://doi.org/10.18653/v1/D15-1202 Solving general arithmetic word problems . EMNLP
-
[60]
Subhro Roy, Tim Vieira, and Dan Roth. 2015. https://doi.org/10.1162/tacl_a_00118 Reasoning about Quantities in Natural Language . TACL
-
[61]
Mohammed Saeed, Naser Ahmadi, Preslav Nakov, and Paolo Papotti. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.110 R ule BERT : Teaching soft rules to pre-trained language models . EMNLP
-
[62]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. https://arxiv.org/abs/2110.08207 Multitask prompted training enables zero-shot task generalization . ICLR
work page internal anchor Pith review arXiv 2022
-
[63]
Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. 2021. https://aclanthology.org/2021.findings-emnlp.195 Generate & rank: A multi-task framework for math word problems . In Findings of the Association for Computational Linguistics: EMNLP 2021
work page 2021
-
[64]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . NAACL
- [65]
- [66]
- [67]
-
[68]
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. https://arxiv.org/abs/2201.08239 La MDA : Language models for dialog applications . arXiv preprint arXiv:2201.08239
work page Pith review arXiv 2022
-
[69]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022 a . https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . arXiv preprint arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022 b . https://arxiv.org/abs/2204.07705 Benchmarking generalization via in-context instructions on 1,600+ language tasks . arXiv preprint arXiv:2204.07705
-
[71]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022 a . https://openreview.net/forum?id=gEZrGCozdqR Finetuned language models are zero-shot learners . ICLR
work page 2022
-
[72]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022 b . https://openreview.net/forum?id=yzkSU5zdwD Emergent abilities of large language models . Transactions on Machine Learning Research
work page 2022
- [73]
- [74]
-
[75]
Sarah Wiegreffe, Ana Marasovi \'c , and Noah A. Smith. 2021. https://aclanthology.org/2021.emnlp-main.804 M easuring association between labels and free-text rationales . EMNLP
work page 2021
-
[76]
Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022 a . https://dl.acm.org/doi/abs/10.1145/3491101.3519729 Prompt C hainer: Chaining large language model prompts through visual programming . CHI Extended Abstracts
-
[77]
Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022 b . https://dl.acm.org/doi/abs/10.1145/3491102.3517582 A I chains: Transparent and controllable human- AI interaction by chaining large language model prompts . CHI
- [78]
-
[79]
Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, and Xiang Ren. 2021. https://proceedings.neurips.cc/paper/2021/hash/4b26dc4663ccf960c8538d595d0a1d3a-Abstract.html Refining language models with compositional explanations . NeurIPS
work page 2021
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.