pith. machine review for the scientific record. sign in

arxiv: 2201.11903 · v6 · submitted 2022-01-28 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Brian Ichter, Dale Schuurmans, Denny Zhou, Ed Chi, Fei Xia, Jason Wei, Maarten Bosma, Quoc Le, Xuezhi Wang

Pith reviewed 2026-05-10 12:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords chain of thoughtpromptingreasoninglarge language modelsGSM8Kfew-shot promptingarithmetic reasoning
0
0 comments X

The pith

Chain of thought prompting lets large language models reach state-of-the-art accuracy on math word problems using only eight examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models gain substantial reasoning ability when their prompts include a few examples that demonstrate a chain of intermediate steps rather than direct answers. This simple change improves results on arithmetic, commonsense, and symbolic tasks without any model training or parameter updates. The gains are especially large for the biggest models tested, where a 540 billion parameter system using eight such examples surpasses the prior best result on the GSM8K math benchmark, including systems that were fine-tuned and equipped with separate verifiers. A sympathetic reader would care because the method shows how to unlock capabilities already present in scaled models through prompt design alone.

Core claim

Generating a chain of thought, a series of intermediate reasoning steps, significantly improves the ability of large language models to perform complex reasoning. Such reasoning abilities emerge naturally in sufficiently large language models via chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in the prompt.

What carries the argument

Chain of thought prompting: the inclusion of a small number of input examples that each show a sequence of explicit reasoning steps before the final answer.

If this is right

  • A 540B model with eight chain-of-thought exemplars reaches state-of-the-art accuracy on GSM8K, beating fine-tuned GPT-3 with a verifier.
  • The same prompting method improves results across arithmetic, commonsense, and symbolic reasoning benchmarks.
  • Reasoning performance scales with model size once chain-of-thought exemplars are supplied.
  • Complex tasks become solvable without retraining or additional fine-tuning data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to problems that require much longer reasoning chains if the prompt examples are lengthened accordingly.
  • It opens a route to more interpretable model outputs by making the generated steps visible to users.
  • Standard few-shot prompting may systematically underestimate what current models can do on reasoning benchmarks.
  • Varying the structure of the reasoning steps in the examples would test whether models follow the logic or merely copy surface patterns.

Load-bearing premise

The performance gains are caused by the explicit reasoning steps rather than by simply supplying longer or more detailed prompts in general.

What would settle it

A controlled test in which prompts of matched length contain no reasoning steps but still produce the same accuracy gains on GSM8K would falsify the claim.

read the original abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper introduces chain-of-thought (CoT) prompting, in which few-shot exemplars are augmented with explicit sequences of intermediate reasoning steps. Experiments across three large language models (including PaLM-540B) demonstrate that this format yields substantial accuracy gains on arithmetic, commonsense, and symbolic reasoning benchmarks relative to standard few-shot prompting. The headline result is that eight fixed CoT exemplars enable the 540B model to reach state-of-the-art accuracy on GSM8K, surpassing a fine-tuned GPT-3 model equipped with a verifier. Gains are shown to emerge only at large scale, with supporting ablations and multiple runs on most tasks.

Significance. If the empirical results hold, the work is significant because it shows that complex reasoning capabilities can be elicited from LLMs via a simple, training-free prompting change. The consistent improvements across task families, the clear scaling threshold, the use of held-out test sets, and the reporting of multiple runs and error bars provide solid grounding. The approach has immediate practical value for deploying LLMs on reasoning problems and raises interesting questions about how reasoning emerges in large models.

minor comments (4)
  1. [§3.1] The answer-extraction procedure for GSM8K (and similar math tasks) should be described more explicitly, including how cases where the model fails to produce a boxed final answer are handled and whether any post-processing rules were tuned on the test set.
  2. [Figure 2] Figure 2 (scaling curves) would benefit from error bars on every point rather than only on selected runs; this would make the emergence threshold at large scale easier to assess visually.
  3. [§4.3] The paper compares CoT prompting against standard few-shot baselines but does not include a control that matches prompt length while removing the reasoning structure (e.g., repeated filler sentences). While the existing ablations make a length-only explanation unlikely, this additional control would further isolate the contribution of the reasoning format.
  4. [Appendix B] A brief discussion of how exemplar selection was performed (random vs. curated) and whether results are sensitive to the particular eight GSM8K exemplars would strengthen reproducibility claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough summary of our work, for highlighting its significance, and for recommending acceptance. No major comments or criticisms were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents purely empirical results from prompting experiments on large language models, with all accuracy metrics measured on held-out standard benchmarks such as GSM8K. No equations, derivations, or fitted parameters appear that could reduce claimed gains to quantities defined by the same inputs. The method is a straightforward prompting technique whose effects are directly observed via comparisons to few-shot baselines, and no self-citation chain or uniqueness theorem is invoked to justify the central claims. The derivation chain is therefore self-contained and consists only of experimental observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim is supported by empirical measurement rather than new theoretical axioms or fitted parameters; the work assumes only that large language models can follow in-context patterns, a standard premise in prompting research.

axioms (1)
  • domain assumption Large language models can learn to imitate reasoning patterns shown in a small number of in-context examples
    Invoked throughout the prompting experiments; this is the background assumption shared by all few-shot prompting methods.

pith-pipeline@v0.9.0 · 5447 in / 1297 out tokens · 31141 ms · 2026-05-10T12:49:15.009832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Slot Machines: How LLMs Keep Track of Multiple Entities

    cs.CL 2026-04 unverdicted novelty 8.0

    LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

  2. Stability and Generalization in Looped Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...

  3. GIANTS: Generative Insight Anticipation from Scientific Literature

    cs.CL 2026-04 unverdicted novelty 8.0

    GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

  4. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  5. Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    cs.CL 2023-05 accept novelty 8.0

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  6. Generative Agents: Interactive Simulacra of Human Behavior

    cs.HC 2023-04 accept novelty 8.0

    Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

  7. Instruction Tuning with GPT-4

    cs.CL 2023-04 unverdicted novelty 8.0

    GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

  8. Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

  9. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

  10. CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

    cs.CL 2026-05 unverdicted novelty 7.0

    CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...

  11. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  12. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  13. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  14. TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

    cs.CV 2026-04 accept novelty 7.0

    TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

  15. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

  16. RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    cs.SE 2026-04 unverdicted novelty 7.0

    RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

  17. R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

    cs.LG 2026-04 unverdicted novelty 7.0

    R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.

  18. HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

    cs.AI 2026-04 unverdicted novelty 7.0

    HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

  19. Navigating the Conceptual Multiverse

    cs.HC 2026-04 unverdicted novelty 7.0

    The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...

  20. RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

    cs.CL 2026-04 unverdicted novelty 7.0

    RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

  21. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  22. IE as Cache: Information Extraction Enhanced Agentic Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.

  23. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  24. An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

    cs.SE 2026-04 unverdicted novelty 7.0

    ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

  25. Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

  26. Unlocking Prompt Infilling Capability for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

  27. Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

    cs.CL 2026-04 accept novelty 7.0

    Banning filler words like 'very' and 'just' improved LLM reasoning by 6.7 percentage points while E-Prime improved it by only 3.7, with gains ranking in exact inverse order of theoretical depth across models and tasks.

  28. Internalized Reasoning for Long-Context Visual Document Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.

  29. BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

    cs.NE 2026-03 unverdicted novelty 7.0

    BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

  30. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  31. Let's Verify Step by Step

    cs.LG 2023-05 accept novelty 7.0

    Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

  32. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  33. Reflexion: Language Agents with Verbal Reinforcement Learning

    cs.AI 2023-03 conditional novelty 7.0

    Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.

  34. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    cs.RO 2022-04 accept novelty 7.0

    SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

  35. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  36. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

  37. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

    cs.AI 2026-05 unverdicted novelty 6.0

    A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

  38. Continuous Latent Contexts Enable Efficient Online Learning in Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformers equipped with continuous latent context tokens can implement foundational online decision-making algorithms such as weighted majority and Q-learning, and a trained small model outperforms larger LLMs on s...

  39. Let the Target Select for Itself: Data Selection via Target-Aligned Paths

    cs.LG 2026-05 unverdicted novelty 6.0

    Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.

  40. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  41. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  42. BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.

  43. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  44. Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

    cs.CL 2026-05 conditional novelty 6.0

    Atomic fact-checking of LLM oncology recommendations increased clinician trust from 26.9% to 66.5% (Cohen's d=0.94) in a trial of 356 doctors.

  45. Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.

  46. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  47. State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...

  48. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  49. Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

    cs.AI 2026-04 unverdicted novelty 6.0

    A multi-agent LLM architecture with four artifact-driven roles produces ontologies from insurance contracts that have significantly better structural quality and modestly better queryability than a single-agent baseli...

  50. Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

    cs.AI 2026-04 unverdicted novelty 6.0

    Distilling and retrieving reusable reasoning skills lets LLMs solve coding and math problems with fewer tokens and higher accuracy.

  51. You Don't Need Public Tests to Generate Correct Code

    cs.SE 2026-04 unverdicted novelty 6.0

    DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...

  52. Job Skill Extraction via LLM-Centric Multi-Module Framework

    cs.CL 2026-04 unverdicted novelty 6.0

    SRICL combines semantic retrieval from ESCO, in-context learning, fine-tuning, and output verification to achieve higher STRICT-F1 scores and fewer invalid or hallucinated skill spans than GPT-3.5 baselines on six pub...

  53. Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework

    cs.SE 2026-04 unverdicted novelty 6.0

    A new five-principle framework applied to 34 practitioner AI governance prompts finds 37% lack key structural elements such as data classification and rubrics.

  54. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  55. ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 6.0

    ShadowPEFT replaces distributed low-rank weight perturbations with a centralized, depth-shared shadow module that evolves parallel hidden states layer by layer, matching or beating LoRA and DoRA on generation and unde...

  56. Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

    cs.IR 2026-04 unverdicted novelty 6.0

    CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

  57. Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

    cs.CL 2026-04 unverdicted novelty 6.0

    Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than pe...

  58. Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

    cs.AI 2026-04 unverdicted novelty 6.0

    Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.

  59. Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants

    cs.AI 2026-04 unverdicted novelty 6.0

    A symbolic protocol operationalizes Peirce's tripartite reasoning for LLMs using five algebraic invariants including a Weakest Link bound to enforce logical consistency and prevent weak premises from supporting strong...

  60. Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation

    cs.CV 2026-04 unverdicted novelty 6.0

    A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 125 Pith papers · 12 internal anchors

  1. [1]

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. 2022. https://arxiv.org/abs/2204.01691 Do as I can, not as I say: Grounding language in robotic affordances . arXiv preprint arXiv:2204.01691

  2. [2]

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. https://aclanthology.org/N19-1245 M ath QA : Towards interpretable math word problem solving with operation-based formalisms . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Hu...

  3. [3]

    Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. https://doi.org/10.18653/v1/D19-1609 Giving BERT a calculator: Finding operations and arguments with reading comprehension . EMNLP

  4. [4]

    Jacob Andreas, Dan Klein, and Sergey Levine. 2018. https://aclanthology.org/N18-1197 Learning with latent language . NAACL

  5. [5]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . arXiv preprint arXiv:2108.07732

  6. [6]

    BIG-bench collaboration . 2021. https://github.com/google/BIG-bench/ Beyond the imitation game: Measuring and extrapolating the capabilities of language models . In preparation

  7. [7]

    Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.506 Flexible generation of natural language deductions . EMNLP

  8. [8]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  9. [9]

    Jonathon Cai, Richard Shin, and Dawn Song. 2017. https://arxiv.org/abs/1704.06611 Making neural programming architectures generalize via recursion . ICLR

  10. [10]

    Oana-Maria Camburu, Tim Rockt \"a schel, Thomas Lukasiewicz, and Phil Blunsom. 2018. https://arxiv.org/pdf/1812.01193.pdf e- SNLI : Natural language inference with natural language explanations . NeurIPS

  11. [11]

    Howard Chen, Jacqueline He, Karthik Narasimhan, and Danqi Chen. 2022. https://arxiv.org/abs/2204.11790 Can rationalization improve robustness? NAACL

  12. [12]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. https://arxiv.org/abs/2107.03374 Evaluating large language models trained on code . arXiv preprint arXiv:2107.03374

  13. [13]

    Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, and Quoc V. Le. 2019. https://openreview.net/forum?id=ryxjnREFwH Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension . ICLR

  14. [14]

    Ting-Rui Chiang and Yun-Nung Chen. 2019. https://doi.org/10.18653/v1/N19-1272 Semantically-aligned equation generation for solving and reasoning math word problems . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 26...

  15. [15]

    Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. https://www.ijcai.org/proceedings/2020/0537.pdf Transformers as soft reasoners over language . IJCAI

  16. [16]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168

  17. [17]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://aclanthology.org/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . NAACL

  18. [18]

    Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. 2019. https://arxiv.org/abs/1904.11694 Neural logic machines . ICLR

  19. [19]

    Dheeru Dua, Sameer Singh, and Matt Gardner. 2020. https://aclanthology.org/2020.acl-main.497 Benefits of intermediate annotations in reading comprehension . ACL

  20. [20]

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. https://doi.org/10.1162/tacl_a_00370 Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . TACL

  21. [21]

    Yuling Gu, Bhavana Dalvi Mishra, and Peter Clark. 2022. https://arxiv.org/pdf/2112.08656.pdf DREAM : Uncovering mental models behind language models . NAACL

  22. [22]

    Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . 2018. https://doi.org/10.18653/v1/P18-1175 Training classifiers with natural language explanations . ACL

  23. [23]

    Peter Hase and Mohit Bansal. 2022. https://arxiv.org/abs/2102.02201 When can models learn from explanations? a formal framework for understanding the roles of explanation data . ACL

  24. [24]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2103.03874 Measuring mathematical problem solving with the math dataset . arXiv preprint arXiv:2103.03874

  25. [25]

    Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. https://doi.org/10.3115/v1/D14-1058 Learning to solve arithmetic word problems with verb categorization . EMNLP

  26. [26]

    Zhanming Jie, Jierui Li, and Wei Lu. 2022. https://arxiv.org/abs/2203.10316 Learning to reason deductively: Math word problem solving as complex relation extraction . arXiv preprint arXiv:2203.10316

  27. [27]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://arxiv.org/abs/2001.08361 Scaling laws for neural language models . arXiv preprint arXiv:2001.08361

  28. [28]

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. https://doi.org/10.18653/v1/N16-1136 MAWPS : A math word problem repository . NAACL

  29. [29]

    arXiv preprint arXiv:2204.02329 , year=

    Andrew K. Lampinen, Ishita Dasgupta, Stephanie C.Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. 2022. https://arxiv.org/abs/2204.02329 Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329

  30. [30]

    Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. 2021. https://arxiv.org/abs/2109.00799 MWPT oolkit: An open-source framework for deep learning-based math word problem solvers . arXiv preprint arXiv:2109.00799

  31. [31]

    Teven Le Scao and Alexander Rush. 2021. https://doi.org/10.18653/v1/2021.naacl-main.208 How many data points is a prompt worth? NAACL

  32. [32]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.243 The power of scale for parameter-efficient prompt tuning . EMNLP

  33. [33]

    Iddo Lev, Bill MacCartney, Christopher Manning, and Roger Levy. 2004. https://aclanthology.org/W04-0902 Solving logic puzzles: From robust processing to precise semantics . Proceedings of the 2nd Workshop on Text Meaning and Interpretation

  34. [34]

    Xiang Lisa Li and Percy Liang. 2021. https://doi.org/10.18653/v1/2021.acl-long.353 Prefix-tuning: Optimizing continuous prompts for generation . ACL

  35. [35]

    Zhengzhong Liang, Steven Bethard, and Mihai Surdeanu. 2021. https://doi.org/10.18653/v1/2021.naacl-main.97 Explainable multi-hop verbal reasoning through internal monologue . NAACL

  36. [36]

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. https://doi.org/10.18653/v1/P17-1015 Program induction by rationale generation: Learning to solve and explain algebraic word problems . ACL

  37. [37]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. https://arxiv.org/abs/2107.13586 Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing . arXiv preprint arXiv:2107.13586

  38. [38]

    Bodhisattwa Prasad Majumder, Oana-Maria Camburu, Thomas Lukasiewicz, and Julian McAuley. 2021. https://arxiv.org/abs/2106.13876 Rationale-inspired natural language explanations with commonsense . arXiv preprint arXiv:2106.13876

  39. [39]

    Ana Marasovi \'c , Iz Beltagy, Doug Downey, and Matthew E Peters. 2022. http://arxiv.org/abs/2111.08284 Few-shot self-rationalization with natural language prompts . NAACL Findings

  40. [40]

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In ACL

  41. [41]

    Shen Yun Miao, Chao Chun Liang, and Keh Yih Su. 2020. https://doi.org/10.18653/v1/2020.acl-main.92 A diverse corpus for evaluating and developing E nglish math word problem solvers . ACL

  42. [42]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. https://arxiv.org/abs/2202.12837 Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837

  43. [43]

    Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. 2020. https://arxiv.org/abs/2004.14546 WT5 ?! T raining text-to-text models to explain their predictions . arXiv preprint arXiv:2004.14546

  44. [44]

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. http://arxiv.org/abs/2112.00114 Show your work: Scratchpads for intermediate computation with language models . arXiv preprint arXiv:2112.00114

  45. [45]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. https://arxiv.org/abs/2203.02155 Training language models to follow instructions with human feedback . arXiv preprint arXiv:2203.02155

  46. [46]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://aclanthology.org/2021.naacl-main.168.pdf Are NLP models really able to solve simple math word problems? NAACL

  47. [47]

    Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://aclanthology.org/N18-1202 Deep contextualized word representations . NAACL

  48. [48]

    Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, and Weizhu Chen. 2022. https://arxiv.org/abs/2201.11473 Reasoning like program executors . arXiv preprint arXiv:2201.11473

  49. [49]

    Piotr Pi e kos, Mateusz Malinowski, and Henryk Michalewski. 2021. https://doi.org/10.18653/v1/2021.acl-short.49 Measuring and improving BERT ' s mathematical abilities by predicting the order of reasoning. ACL

  50. [50]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. https://arxiv.org/abs/2112.11446 Scaling language models: Methods, analysis & insights from training G opher . arXiv preprint arXiv:2112.11446

  51. [51]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. https://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21:1--67

  52. [52]

    and Tsvetkov, Yulia

    Dheeraj Rajagopal, Vidhisha Balachandran, Eduard H. Hovy, and Yulia Tsvetkov. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.64 SelfExplain : A self-explaining architecture for neural text classifiers . EMNLP

  53. [53]

    Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. https://doi.org/10.18653/v1/P19-1487 Explain yourself! L everaging language models for commonsense reasoning . ACL

  54. [54]

    Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. https://doi.org/10.18653/v1/D19-1251 N um N et: Machine reading comprehension with numerical reasoning . EMNLP

  55. [55]

    Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. 2021. https://arxiv.org/abs/2112.12870 Measuring attribution in natural language generation models . arXiv preprint arXiv:2112.12870

  56. [56]

    Gabriel Recchia. 2021. https://arxiv.org/abs/2109.02102 Teaching autoregressive language models complex tasks by demonstration . arXiv preprint arXiv:2109.02102

  57. [57]

    Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2022. https://arxiv.org/abs/2109.03910 A recipe for arbitrary text style transfer with large language models . ACL

  58. [58]

    Laria Reynolds and Kyle McDonell. 2021. https://arxiv.org/abs/2102.07350 Prompt programming for large language models: Beyond the few-shot paradigm . Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

  59. [59]

    Subhro Roy and Dan Roth. 2015. https://doi.org/10.18653/v1/D15-1202 Solving general arithmetic word problems . EMNLP

  60. [60]

    Subhro Roy, Tim Vieira, and Dan Roth. 2015. https://doi.org/10.1162/tacl_a_00118 Reasoning about Quantities in Natural Language . TACL

  61. [61]

    Mohammed Saeed, Naser Ahmadi, Preslav Nakov, and Paolo Papotti. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.110 R ule BERT : Teaching soft rules to pre-trained language models . EMNLP

  62. [62]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. https://arxiv.org/abs/2110.08207 Multitask prompted training enables zero-shot task generalization . ICLR

  63. [63]

    Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Jiang, Ming Zhang, and Qun Liu. 2021. https://aclanthology.org/2021.findings-emnlp.195 Generate & rank: A multi-task framework for math word problems . In Findings of the Association for Computational Linguistics: EMNLP 2021

  64. [64]

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://doi.org/10.18653/v1/N19-1421 C ommonsense QA : A question answering challenge targeting commonsense knowledge . NAACL

  65. [65]

    Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. 2020. https://arxiv.org/abs/2006.06609 Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge . NeurIPS

  66. [66]

    Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. 2021. https://arxiv.org/abs/2201.05320 Commonsense QA 2.0: E xposing the limits of ai through gamification . NeurIPS Track on Datasets and Benchmarks

  67. [67]

    Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. 2022. https://arxiv.org/abs/2205.05131 Unifying language learning paradigms . arXiv preprint arXiv:2205.05131

  68. [68]

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. https://arxiv.org/abs/2201.08239 La MDA : Language models for dialog applications . arXiv preprint arXiv:2201.08239

  69. [69]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022 a . https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . arXiv preprint arXiv:2203.11171

  70. [70]

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022 b . https://arxiv.org/abs/2204.07705 Benchmarking generalization via in-context instructions on 1,600+ language tasks . arXiv preprint arXiv:2204.07705

  71. [71]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022 a . https://openreview.net/forum?id=gEZrGCozdqR Finetuned language models are zero-shot learners . ICLR

  72. [72]

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022 b . https://openreview.net/forum?id=yzkSU5zdwD Emergent abilities of large language models . Transactions on Machine Learning Research

  73. [73]

    Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi. 2022. https://arxiv.org/abs/2112.08674 Reframing human- AI collaboration for generating free-text explanations . NAACL

  74. [74]

    Sarah Wiegreffe and Ana Marasovi \'c . 2021. https://arxiv.org/abs/2102.12060 Teach me to explain: A review of datasets for explainable NLP . NeurIPS

  75. [75]

    Sarah Wiegreffe, Ana Marasovi \'c , and Noah A. Smith. 2021. https://aclanthology.org/2021.emnlp-main.804 M easuring association between labels and free-text rationales . EMNLP

  76. [76]

    Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022 a . https://dl.acm.org/doi/abs/10.1145/3491101.3519729 Prompt C hainer: Chaining large language model prompts through visual programming . CHI Extended Abstracts

  77. [77]

    Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022 b . https://dl.acm.org/doi/abs/10.1145/3491102.3517582 A I chains: Transparent and controllable human- AI interaction by chaining large language model prompts . CHI

  78. [78]

    Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi. 2020. https://arxiv.org/abs/2006.08084 Neural execution engines: Learning to execute subroutines . NeurIPS

  79. [79]

    Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, and Xiang Ren. 2021. https://proceedings.neurips.cc/paper/2021/hash/4b26dc4663ccf960c8538d595d0a1d3a-Abstract.html Refining language models with compositional explanations . NeurIPS

  80. [80]

    Xi Ye and Greg Durrett. 2022. https://arxiv.org/abs/2205.03401 The unreliability of explanations in few-shot in-context learning . arXiv preprint arXiv:2205.03401

Showing first 80 references.