hub

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber · 2021 · cs.LG · arXiv 2112.00114

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open full Pith review browse 27 citing papers arXiv PDF

abstract

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

LLMs rely on semantic cues for matrix-game equilibria but can acquire approximate computation via residual training on small instances, with a Lipschitz proof enabling transfer to larger anonymous games.

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Online mistake bounds for autoregressive output learning can grow logarithmically with generation horizon M under end-to-end feedback but become independent of M with chain-of-thought trajectory access.

Training Transformers as a Universal Computer

cs.AI · 2026-04-28 · unverdicted · novelty 7.0

A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Let's Verify Step by Step

cs.LG · 2023-05-31 · accept · novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.

CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

cs.SE · 2026-04-14 · unverdicted · novelty 6.0

CoDe-R refines LLM decompiler output via rationale-guided semantic injection and dynamic fallback inference, making a 1.3B model the first to exceed 50% average re-executability on HumanEval-Decompile.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Improving Factuality and Reasoning in Language Models through Multiagent Debate

cs.CL · 2023-05-23 · unverdicted · novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

Teaching Large Language Models to Self-Debug

cs.CL · 2023-04-11 · unverdicted · novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

cs.CL · 2022-10-17 · accept · novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

Inner Monologue: Embodied Reasoning through Planning with Language Models

cs.RO · 2022-07-12 · unverdicted · novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Emergent Abilities of Large Language Models

cs.CL · 2022-06-15 · unverdicted · novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

cs.AI · 2026-05-07 · conditional · novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

citing papers explorer

Showing 27 of 27 citing papers.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 44 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication cs.LG · 2026-03-30 · unverdicted · none · ref 26 · internal anchor
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
LLMs rely on semantic cues for matrix-game equilibria but can acquire approximate computation via residual training on small instances, with a Lipschitz proof enabling transfer to larger anonymous games.
A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning cs.LG · 2026-05-07 · unverdicted · none · ref 18 · internal anchor
Online mistake bounds for autoregressive output learning can grow logarithmically with generation horizon M under end-to-end feedback but become independent of M with chain-of-thought trajectory access.
Training Transformers as a Universal Computer cs.AI · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning cs.CL · 2026-04-19 · unverdicted · none · ref 51 · internal anchor
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 95 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Let's Verify Step by Step cs.LG · 2023-05-31 · accept · none · ref 13 · internal anchor
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks cs.CL · 2022-11-22 · unverdicted · none · ref 20 · internal anchor
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA cs.AI · 2026-05-05 · unverdicted · none · ref 58 · internal anchor
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology cs.CL · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.
CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference cs.SE · 2026-04-14 · unverdicted · none · ref 22 · internal anchor
CoDe-R refines LLM decompiler output via rationale-guided semantic injection and dynamic fallback inference, making a 1.3B model the first to exceed 50% average re-executability on HumanEval-Decompile.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 149 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 288 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Improving Factuality and Reasoning in Language Models through Multiagent Debate cs.CL · 2023-05-23 · unverdicted · none · ref 20 · internal anchor
Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
Teaching Large Language Models to Self-Debug cs.CL · 2023-04-11 · unverdicted · none · ref 114 · internal anchor
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them cs.CL · 2022-10-17 · accept · none · ref 19 · internal anchor
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Inner Monologue: Embodied Reasoning through Planning with Language Models cs.RO · 2022-07-12 · unverdicted · none · ref 13 · internal anchor
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 203 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 60 · internal anchor
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 105 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems cs.AI · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 5 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 235 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning cs.AI · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
SPREG detects logical failures in LLM long-chain reasoning through real-time entropy spikes and performs structured plan repairs using historical distributions, reporting a 20% absolute accuracy gain on AIME25.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications cs.AI · 2024-02-05 · unverdicted · none · ref 12 · internal anchor
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer