hub Mixed citations

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber · 2021 · cs.LG · arXiv 2112.00114

Mixed citation behavior. Most common role is background (55%).

54 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 54 citing papers arXiv PDF

abstract

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 dataset 1 method 1 other 1

citation-polarity summary

background 6 support 2 unclear 1 use dataset 1 use method 1

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

LLMs rely on semantic cues for matrix-game equilibria but can acquire approximate computation via residual training on small instances, with a Lipschitz proof enabling transfer to larger anonymous games.

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Online mistake bounds for autoregressive output learning can grow logarithmically with generation horizon M under end-to-end feedback but become independent of M with chain-of-thought trajectory access.

Training Transformers as a Universal Computer

cs.AI · 2026-04-28 · unverdicted · novelty 7.0

A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

cs.AI · 2026-03-28 · unverdicted · novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Let's Verify Step by Step

cs.LG · 2023-05-31 · accept · novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering

stat.ME · 2026-05-20 · unverdicted · novelty 6.0

Proposes nearly balanced TCARDs that minimize the first two generalized word-length pattern components, defines Φ_BCD criterion linked to classical optimality, and constructs designs via coordinate exchange with simulation-calibrated weights for LLM prompt engineering.

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Strategy-Induct induces task-level instructions from question-only examples by generating reasoning strategies first, then using those pairs to create a guiding instruction.

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

In a minimal two-state recurrent Transformer, asymmetric input injection induces stable specialization where one state becomes a committed proposal and the other retains shifting uncertainty.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.

CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

cs.SE · 2026-04-14 · unverdicted · novelty 6.0

CoDe-R refines LLM decompiler output via rationale-guided semantic injection and dynamic fallback inference, making a 1.3B model the first to exceed 50% average re-executability on HumanEval-Decompile.

Deep sequence models tend to memorize geometrically; it is unclear why

cs.LG · 2025-10-30 · unverdicted · novelty 6.0

Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.

citing papers explorer

Showing 14 of 14 citing papers after filters.

Training Transformers as a Universal Computer cs.AI · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking cs.AI · 2026-03-28 · unverdicted · none · ref 47 · internal anchor
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 41 · internal anchor
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA cs.AI · 2026-05-05 · unverdicted · none · ref 58 · internal anchor
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems cs.AI · 2025-10-07 · unverdicted · none · ref 16 · internal anchor
ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 149 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 172 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 56 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems cs.AI · 2026-05-21 · unverdicted · none · ref 115 · internal anchor
Converts impossibility theorems into architecture-dependent accuracy ceilings and design rules for transformers and other AI subfields, with the Deterministic Horizon measured at 19-31 across twelve models.
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems cs.AI · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 5 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning cs.AI · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
SPREG detects logical failures in LLM long-chain reasoning through real-time entropy spikes and performs structured plan repairs using historical distributions, reporting a 20% absolute accuracy gain on AIME25.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications cs.AI · 2024-02-05 · unverdicted · none · ref 12 · internal anchor
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer