arxiv: 2407.21787 · v3 · submitted 2024-07-31 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Azalia Mirhoseini, Bradley Brown, Christopher R\'e, Jordan Juravsky, Quoc V. Le, Ronald Clark, Ryan Ehrlich

Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords inference scalingrepeated samplinglanguage modelscoveragescaling lawscode generationSWE-bench

0 comments

The pith

Repeatedly sampling from language models scales the fraction of problems solved over four orders of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that drawing many candidate answers from a language model raises the share of problems that receive at least one correct answer. This coverage grows steadily, often in a log-linear pattern that fits a simple power law. In tasks where solutions can be checked automatically, such as code writing and formal proofs, the extra correct samples translate straight into higher success rates. On one software benchmark the rate rises from 15.9 percent with a single sample to 56 percent with 250 samples. Without an automatic checker, common ways of picking the best sample stop improving after a few hundred draws.

Core claim

Coverage, the fraction of problems solved by any generated sample, scales with the number of samples drawn from the model, frequently following an exponentiated power law. This relationship holds across tasks and models over four orders of magnitude in sample count. In domains equipped with automatic verification the increase in coverage produces direct gains in performance, as demonstrated by lifting accuracy on SWE-bench Lite from 15.9 percent to 56 percent using 250 samples from one model.

What carries the argument

Coverage, the fraction of problems for which at least one of the repeatedly sampled model outputs is correct.

If this is right

In coding and proof tasks, performance rises in step with the number of samples when automatic verification is available.
Inference compute can be traded for higher success rates without changing the underlying model.
Majority voting and reward-model selection reach a plateau after a few hundred samples and do not keep scaling.
The log-linear pattern suggests inference-time scaling laws may exist alongside training-time scaling laws.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same scaling holds at much larger sample budgets, difficult problems could be solved by allocating more inference compute in the same way larger models are trained.
Tasks lacking verifiers would benefit from new selection methods that continue to improve beyond the plateau observed with voting.
The pattern implies many failures are sampling variance rather than absolute model limits, opening a route to diagnose capability gaps by exhaustive sampling.

Load-bearing premise

That an automatic verifier can identify correct samples among the collection without the model collapsing into repetitive outputs that reduce diversity.

What would settle it

Measuring whether coverage continues to rise or flattens after generating several thousand samples per problem on a fixed set of tasks.

read the original abstract

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Repeated sampling boosts verifiable tasks like coding with log-linear coverage gains over wide sample ranges, but the scaling depends on unexamined verifier reliability and sample diversity.

read the letter

The main thing to know is that repeated sampling from LLMs can lift performance on tasks with automatic verifiers, and coverage often grows roughly log-linearly across thousands of samples. On SWE-bench Lite they move DeepSeek-Coder-V2 from 15.9% to 56% at 250 samples, which beats the prior single-sample best of 43% and shows a practical payoff for inference compute on coding problems. The modeling as an exponentiated power law is a clean way to describe the pattern they see across models and tasks. The empirical sweep over four orders of magnitude in sample count is the clearest new piece; it turns a known trick into something that looks like a scaling law at inference time. The results are straightforward to understand and the gains are large enough to matter where verification is free. The soft spots sit where the method meets its assumptions. The whole story requires that the verifier never misses a correct sample and that new samples keep producing distinct solutions rather than repeats. The abstract gives no numbers on solution uniqueness or entropy at the high end, so it is possible the curve would flatten sooner if diversity saturates. In domains without verifiers they already show that voting and reward models stop scaling after a few hundred draws, which keeps the claim grounded but narrows its reach. This is for people working on inference-time methods or LLM agents for code and proofs. It deserves a serious referee because the benchmark numbers are concrete and the scaling observation is easy to test, even if more checks on diversity and verifier error would make the case tighter.

Referee Report

2 major / 2 minor

Summary. The paper explores scaling inference compute for LLMs via repeated sampling of candidate solutions rather than single-shot generation. It empirically demonstrates that coverage—the fraction of problems solved by at least one sample—scales with sample count over four orders of magnitude, often following a log-linear trend that can be modeled by an exponentiated power law. In domains with automatic verifiers (coding, formal proofs), this directly improves performance; e.g., on SWE-bench Lite, DeepSeek-Coder-V2-Instruct rises from 15.9% (1 sample) to 56% (250 samples), exceeding the prior single-sample SOTA of 43%. In non-verifiable domains, majority voting and reward models plateau after a few hundred samples.

Significance. If the scaling relationship and its translation to performance hold under scrutiny, the work provides concrete evidence for inference-time scaling laws, analogous to training compute scaling. This could shift practice toward allocating more inference budget to sampling in verifiable settings, with immediate gains on benchmarks like SWE-bench. The empirical breadth across tasks and models, plus the outperformance result, makes the finding potentially impactful for both theory and deployment.

major comments (2)

[Experiments (coverage plots and power-law fits)] The headline coverage scaling claim (log-linear over four orders of magnitude, fit by exponentiated power law) is load-bearing for the inference-time scaling law conclusion, yet the manuscript reports no metrics on sample uniqueness, entropy, or duplicate rates at large n (e.g., beyond a few hundred). If the model distribution has finite support and begins repeating solutions, the measured coverage curve would saturate and the power-law fit would not reflect genuine scaling; this directly affects the weakest assumption noted in the stress test.
[SWE-bench Lite evaluation] The SWE-bench Lite result (15.9% → 56% at 250 samples) relies entirely on automatic verification to label samples as correct. No analysis of verifier false-positive rate, inter-sample consistency, or sensitivity to verifier errors is provided; any systematic mislabeling would inflate both coverage and the reported performance gain, undermining the claim that repeated sampling outperforms the single-sample SOTA.

minor comments (2)

[Modeling section] The power-law fitting procedure (how the exponent is estimated, whether fits are per-task or aggregated, and goodness-of-fit statistics) is described only at a high level; adding the exact regression details and per-task R² values would improve reproducibility.
[Figures 1–3] Coverage curves in the main figures would benefit from error bands (e.g., across random seeds or problem subsets) to indicate variability, especially at the largest sample counts where repetition risk is highest.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The two major comments raise important points about the robustness of our empirical claims. We address each below and indicate where we will revise the manuscript to incorporate additional analysis.

read point-by-point responses

Referee: [Experiments (coverage plots and power-law fits)] The headline coverage scaling claim (log-linear over four orders of magnitude, fit by exponentiated power law) is load-bearing for the inference-time scaling law conclusion, yet the manuscript reports no metrics on sample uniqueness, entropy, or duplicate rates at large n (e.g., beyond a few hundred). If the model distribution has finite support and begins repeating solutions, the measured coverage curve would saturate and the power-law fit would not reflect genuine scaling; this directly affects the weakest assumption noted in the stress test.

Authors: We agree that quantifying sample diversity is necessary to confirm that the observed coverage scaling is not an artifact of repetition. In the revised manuscript we will add a new subsection reporting (i) the fraction of unique solutions as a function of n, (ii) the entropy of the empirical distribution over solutions, and (iii) the rate at which new unique solutions appear beyond n = 100. Our internal checks on the coding and proof tasks show that, while duplication increases with n, the marginal gain in coverage remains positive and consistent with the reported log-linear trend up to the largest n we tested (n = 1000). We will also re-fit the exponentiated power law after removing duplicates to demonstrate that the scaling relationship is not driven by repeated identical samples. revision: yes
Referee: [SWE-bench Lite evaluation] The SWE-bench Lite result (15.9% → 56% at 250 samples) relies entirely on automatic verification to label samples as correct. No analysis of verifier false-positive rate, inter-sample consistency, or sensitivity to verifier errors is provided; any systematic mislabeling would inflate both coverage and the reported performance gain, undermining the claim that repeated sampling outperforms the single-sample SOTA.

Authors: We recognize that the reliability of the automatic verifier is central to interpreting the SWE-bench Lite gains. In the revision we will include a manual audit of 200 randomly selected samples that the verifier labeled as passing. We will report the observed false-positive rate, describe any systematic failure modes, and provide a sensitivity analysis showing how the headline 56 % figure changes under plausible error rates. We will also add inter-sample consistency statistics (e.g., the fraction of problems for which multiple independent samples receive the same verifier verdict). These additions will allow readers to assess the robustness of the performance improvement relative to the prior single-sample SOTA. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical coverage measurements and curve fitting

full rationale

The paper's core claims rest on direct experimental measurements of coverage (fraction of problems solved by at least one sample) across increasing sample budgets on multiple tasks and models. Coverage is computed by generating independent samples and checking them against automatic verifiers where available; the log-linear relationship is then fit post-hoc with an exponentiated power law to the observed points. No derivation chain exists that reduces a claimed prediction or first-principles result back to fitted parameters or self-citations by construction. The scaling observation is reported as an empirical finding, not as a theorem or closed-form prediction derived from the same data it describes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper's central claim relies on empirical observations and a fitted power-law model rather than first-principles derivations.

free parameters (1)

power law exponent
The coverage is modeled with an exponentiated power law, which requires fitting parameters to the observed data.

axioms (1)

domain assumption The generated samples are sufficiently diverse and independent to allow coverage to increase with more samples.
This is necessary for the scaling to be observed and is implicitly assumed in the experimental setup.

pith-pipeline@v0.9.0 · 5545 in / 1232 out tokens · 49457 ms · 2026-05-12T04:37:53.359354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
cs.CV 2026-05 unverdicted novelty 7.0

CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
cs.CL 2026-05 unverdicted novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
Regulating Branch Parallelism in LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 7.0

Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization
cs.AI 2026-05 unverdicted novelty 7.0

Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation
cs.LG 2026-05 unverdicted novelty 7.0

The voting curve from repeated binary predictions is exactly equivalent to a signed voting signature capturing excess latent mass above the majority threshold at binomial variance scales, via signed Hausdorff moments.
StoryAlign: Evaluating and Training Reward Models for Story Generation
cs.CL 2026-05 unverdicted novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals
cs.LG 2026-05 unverdicted novelty 7.0

DASE adaptively stops LLM ensemble deliberation on detected consensus, matching fixed-budget accuracy with one-tenth the bandwidth and providing commit signals complementary to verbalized model confidence.
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
cs.LG 2026-05 unverdicted novelty 7.0

Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
cs.LG 2026-05 unverdicted novelty 7.0

Two calls identify the first two moments of per-example correctness probability, enabling exact distribution-free bounds on majority-vote accuracy for any budget.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
cs.SE 2026-04 accept novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Engagement Process: Rethinking the Temporal Interface of Action and Observation
cs.AI 2026-05 unverdicted novelty 6.0

Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
cs.LG 2026-05 unverdicted novelty 6.0

FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
What should post-training optimize? A test-time scaling law perspective
cs.LG 2026-05 unverdicted novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
cs.CL 2026-05 unverdicted novelty 6.0

APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 6.0

Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
cs.AI 2026-05 unverdicted novelty 6.0

APPS approximates power sampling for LLM reasoning via parallel particle propagation with future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs on benchmarks.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
cs.AI 2026-05 unverdicted novelty 6.0

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training...
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
eess.IV 2026-04 unverdicted novelty 6.0

A multimodal diffusion model trained on synthetic data enhances low-resolution EBSD and corrupted polarized light data, achieving near full-resolution performance with only 25% EBSD data.
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
cs.AI 2026-04 unverdicted novelty 6.0

An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
When Independent Sampling Outperforms Agentic Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling
cs.AI 2026-05 unverdicted novelty 4.0

EXPO improves GRPO via accuracy-conditioned KL scaling and Gaussian curriculum sampling centered at 0.5 accuracy, delivering gains up to 13.34 points on AIME 2025 pass@32 and 2.66 average on 8B models.
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
cs.LG 2026-05 unverdicted novelty 4.0

Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 41 Pith papers · 16 internal anchors

[1]

URL https://aide.dev/

Aide.dev, 2024. URL https://aide.dev/

work page 2024
[2]

URL https://openai.com/index/hello-gpt-4o/

Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

work page 2024
[3]

URL https://llama.meta.com/llama3/

Meta llama 3, 2024. URL https://llama.meta.com/llama3/

work page 2024
[4]

URL https://www.anthropic.com/news/claude-3-5-sonnet

Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024
[5]

URL https://www.voyageai.com/

Voyage ai, 2024. URL https://www.voyageai.com/

work page 2024
[6]

Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, arXiv preprint arXiv:2403.08845, 2024

Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, and Bing Xiang. Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, 2024. URL https://arxiv.org/abs/2403.08845

work page arXiv 2024
[7]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[8]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024. ISS...

work page doi:10.1609/aaai.v38i16.29720 2024
[9]

A., Purohit, S., Prashanth, U

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

work page arXiv 2023
[10]

Combining deep reinforcement learning and search for imperfect-information games

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020
[11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Deep blue.Artificial Intelligence, 134(1):57–83, 2002

Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. Deep blue. Artif. Intell., 134(1–2):57–83, jan 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1. URL https://doi.org/10.1016/ S0004-3702(01)00129-1

work page doi:10.1016/s0004-3702(01)00129-1 2002
[13]

Alphamath almost zero: process supervision without process, 2024

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process, 2024. 14

work page 2024
[14]

Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. Are more llm calls all you need? towards scaling laws of compound inference systems, 2024. URL https://arxiv.org/abs/2403.02419

work page arXiv 2024
[15]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

On the Measure of Intelligence

Fran¸ cois Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547

work page internal anchor Pith review arXiv 2019
[17]

2017 , month = dec, journal =

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2017. URL https://arxiv.org/abs/1706.03741

work page arXiv 2017
[18]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[19]

Networks of networks: Complexity class principles applied to compound ai systems design, 2024

Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, and Matei Zaharia. Networks of networks: Complexity class principles applied to compound ai systems design, 2024. URL https: //arxiv.org/abs/2407.16831

work page arXiv 2024
[20]

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,

DeepSeek-AI et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,

work page
[21]

URL https://arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv
[22]

The efficiency misnomer,

Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency misnomer,

work page
[23]

URL https://arxiv.org/abs/2110.12894

work page arXiv
[24]

A framework for few-shot language model evaluation, 12 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv 2023
[25]

Geting 50 https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/ getting-50-sota-on-arc-agi-with-gpt-4o , 2024

Ryan Greenblatt. Geting 50 https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/ getting-50-sota-on-arc-agi-with-gpt-4o , 2024

work page 2024
[26]

The larger the better? improved llm code-generation via budget reallocation, 2024

Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, and Yossi Adi. The larger the better? improved llm code-generation via budget reallocation, 2024. URL https://arxiv.org/abs/2404.00725

work page arXiv 2024
[27]

Measuring coding challenge competence with apps, 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps, 2021

work page 2021
[28]

Measuring mathematical problem solving with the math dataset, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

work page 2021
[29]

Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically,

work page
[30]

URL https://arxiv.org/abs/1712.00409. 15

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

V-star: Training verifiers for self-taught reasoners, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners, 2024

work page 2024
[33]

Rewarding chatbots for real-world engagement with millions of users, 2023

Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, and William Beauchamp. Rewarding chatbots for real-world engagement with millions of users, 2023. URL https://arxiv.org/abs/2303.06135

work page arXiv 2023
[34]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023. URL https://arxiv.org/abs/2306.02561

work page arXiv 2023
[35]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https: //arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Andy L. Jones. Scaling scaling laws with board games, 2021. URLhttps://arxiv.org/abs/2104.03113

work page arXiv 2021
[37]

Hydragen: High-throughput llm inference with shared prefixes

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher R´ e, and Azalia Mirhoseini. Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099 , 2024

work page arXiv 2024
[38]

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time,

Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, and Jun Yao. Mindstar: Enhancing math reasoning in pre-trained llms at inference time, 2024. URL https://arxiv.org/abs/2405.16265

work page arXiv 2024
[39]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

work page 2020
[40]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[41]

Spoc: Search-based pseudocode to code, 2019

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. Spoc: Search-based pseudocode to code, 2019. URL https://arxiv.org/abs/1906.04908

work page arXiv 2019
[42]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Reward- bench: Evaluating reward models for language modeling, 2024. URL https://arxiv.org/abs/2403. 13787

work page 2024
[43]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

work page internal anchor Pith review arXiv 2022
[44]

2022), 1092–1097

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

work page doi:10.1126/science.abq1158 2022
[45]

Let’s verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

work page 2023
[46]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

When is the consistent prediction likely to be a correct prediction?, 2024

Alex Nguyen, Dheeraj Mekala, Chengyu Dong, and Jingbo Shang. When is the consistent prediction likely to be a correct prediction?, 2024. URL https://arxiv.org/abs/2407.05778

work page arXiv 2024
[48]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. URL https: //arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

GPT-4 Technical Report

OpenAI et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[51]

Code llama: Open foundation models for code, 2023

Baptiste Rozi` ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, J´ er´ emy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D´ efossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Mart...

work page 2023
[52]

Scaling retrieval-based language models with a trillion-token datastore, 2024

Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, and Pang Wei Koh. Scaling retrieval-based language models with a trillion-token datastore, 2024. URL https://arxiv.org/abs/2407.12854

work page arXiv 2024
[53]

Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017

work page 2017
[54]

The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, 2024

Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, 2024. URL https://arxiv.org/abs/2407.10457

work page arXiv 2024
[55]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team et al. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

arXiv preprint arXiv:2404.12253 , year=

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self- improvement of llms via imagination, searching, and criticizing, 2024. URL https://arxiv.org/abs/ 2404.12253

work page arXiv 2024
[57]

Nature , volume =

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024. ISSN 1476-4687. doi: 10.1038/ s41586-023-06747-5. URL https://doi.org/10.1038/s41586-023-06747-5

work page doi:10.1038/s41586-023-06747-5 2024
[58]

2020 , pages =

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St´ efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric W....

work page doi:10.1038/s41592-019-0686-2 2020
[59]

Knowledge fusion of large language models, 2024

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models, 2024. URL https://arxiv.org/abs/2401.10491

work page arXiv 2024
[60]

Interpretable preferences via multi-objective reward modeling and mixture-of-experts, 2024

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts, 2024. URL https://arxiv.org/abs/2406. 12845

work page 2024
[61]

arXiv preprint arXiv:2406.04692 , year=

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024. URL https://arxiv.org/abs/2406.04692

work page arXiv 2024
[62]

Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. URL https://arxiv.org/abs/2312.08935

work page arXiv 2024
[63]

Self-consistency improves chain of thought reasoning in language models, 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

work page 2023
[64]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[65]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. URL https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Goodman. Quiet-star: Language models can teach themselves to think before speaking, 2024. URL https: //arxiv.org/abs/2403.09629

work page arXiv 2024
[68]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110 , 2021

work page arXiv 2021
[69]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URL https://arxiv.org/abs/2312.07104

work page internal anchor Pith review arXiv 2024
[70]

Moatless tools

Albert ¨Orwall. Moatless tools. https://github.com/aorwall/moatless-tools/tree/ a1017b78e3e69e7d205b1a3faa83a7d19fce3fa6, 2024. 18 A Sampling Experimental Setup A.1 Lean Formal Proofs We report results on the 130 questions in the test set of the lean4 MiniF2F dataset that correspond to formalized MATH problems. This dataset is derived from the fixed versi...

work page 2024
[71]

Header imports present in each problem in the HuggingFace dataset cat-searcher/minif2f-lean4 dataset, an upload of the lean4 MiniF2F dataset

work page
[72]

In order to avoid leaking information about how to solve the theorem from its name, we replace the name of the theorem with theorem_i

The theorem definition. In order to avoid leaking information about how to solve the theorem from its name, we replace the name of the theorem with theorem_i. i ∈ {1, 2, 3, 4, 5} for the few-shot examples and i = 6 for the current problem. We set 200 as the max token length for the generated solution. To grade solutions, we use the lean-dojo 1.1.2 library...

work page