Recognition: 2 theorem links
· Lean TheoremLarge Language Monkeys: Scaling Inference Compute with Repeated Sampling
Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3
The pith
Repeatedly sampling from language models scales the fraction of problems solved over four orders of magnitude.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Coverage, the fraction of problems solved by any generated sample, scales with the number of samples drawn from the model, frequently following an exponentiated power law. This relationship holds across tasks and models over four orders of magnitude in sample count. In domains equipped with automatic verification the increase in coverage produces direct gains in performance, as demonstrated by lifting accuracy on SWE-bench Lite from 15.9 percent to 56 percent using 250 samples from one model.
What carries the argument
Coverage, the fraction of problems for which at least one of the repeatedly sampled model outputs is correct.
If this is right
- In coding and proof tasks, performance rises in step with the number of samples when automatic verification is available.
- Inference compute can be traded for higher success rates without changing the underlying model.
- Majority voting and reward-model selection reach a plateau after a few hundred samples and do not keep scaling.
- The log-linear pattern suggests inference-time scaling laws may exist alongside training-time scaling laws.
Where Pith is reading between the lines
- If the same scaling holds at much larger sample budgets, difficult problems could be solved by allocating more inference compute in the same way larger models are trained.
- Tasks lacking verifiers would benefit from new selection methods that continue to improve beyond the plateau observed with voting.
- The pattern implies many failures are sampling variance rather than absolute model limits, opening a route to diagnose capability gaps by exhaustive sampling.
Load-bearing premise
That an automatic verifier can identify correct samples among the collection without the model collapsing into repetitive outputs that reduce diversity.
What would settle it
Measuring whether coverage continues to rise or flattens after generating several thousand samples per problem on a fixed set of tasks.
read the original abstract
Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores scaling inference compute for LLMs via repeated sampling of candidate solutions rather than single-shot generation. It empirically demonstrates that coverage—the fraction of problems solved by at least one sample—scales with sample count over four orders of magnitude, often following a log-linear trend that can be modeled by an exponentiated power law. In domains with automatic verifiers (coding, formal proofs), this directly improves performance; e.g., on SWE-bench Lite, DeepSeek-Coder-V2-Instruct rises from 15.9% (1 sample) to 56% (250 samples), exceeding the prior single-sample SOTA of 43%. In non-verifiable domains, majority voting and reward models plateau after a few hundred samples.
Significance. If the scaling relationship and its translation to performance hold under scrutiny, the work provides concrete evidence for inference-time scaling laws, analogous to training compute scaling. This could shift practice toward allocating more inference budget to sampling in verifiable settings, with immediate gains on benchmarks like SWE-bench. The empirical breadth across tasks and models, plus the outperformance result, makes the finding potentially impactful for both theory and deployment.
major comments (2)
- [Experiments (coverage plots and power-law fits)] The headline coverage scaling claim (log-linear over four orders of magnitude, fit by exponentiated power law) is load-bearing for the inference-time scaling law conclusion, yet the manuscript reports no metrics on sample uniqueness, entropy, or duplicate rates at large n (e.g., beyond a few hundred). If the model distribution has finite support and begins repeating solutions, the measured coverage curve would saturate and the power-law fit would not reflect genuine scaling; this directly affects the weakest assumption noted in the stress test.
- [SWE-bench Lite evaluation] The SWE-bench Lite result (15.9% → 56% at 250 samples) relies entirely on automatic verification to label samples as correct. No analysis of verifier false-positive rate, inter-sample consistency, or sensitivity to verifier errors is provided; any systematic mislabeling would inflate both coverage and the reported performance gain, undermining the claim that repeated sampling outperforms the single-sample SOTA.
minor comments (2)
- [Modeling section] The power-law fitting procedure (how the exponent is estimated, whether fits are per-task or aggregated, and goodness-of-fit statistics) is described only at a high level; adding the exact regression details and per-task R² values would improve reproducibility.
- [Figures 1–3] Coverage curves in the main figures would benefit from error bands (e.g., across random seeds or problem subsets) to indicate variability, especially at the largest sample counts where repetition risk is highest.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. The two major comments raise important points about the robustness of our empirical claims. We address each below and indicate where we will revise the manuscript to incorporate additional analysis.
read point-by-point responses
-
Referee: [Experiments (coverage plots and power-law fits)] The headline coverage scaling claim (log-linear over four orders of magnitude, fit by exponentiated power law) is load-bearing for the inference-time scaling law conclusion, yet the manuscript reports no metrics on sample uniqueness, entropy, or duplicate rates at large n (e.g., beyond a few hundred). If the model distribution has finite support and begins repeating solutions, the measured coverage curve would saturate and the power-law fit would not reflect genuine scaling; this directly affects the weakest assumption noted in the stress test.
Authors: We agree that quantifying sample diversity is necessary to confirm that the observed coverage scaling is not an artifact of repetition. In the revised manuscript we will add a new subsection reporting (i) the fraction of unique solutions as a function of n, (ii) the entropy of the empirical distribution over solutions, and (iii) the rate at which new unique solutions appear beyond n = 100. Our internal checks on the coding and proof tasks show that, while duplication increases with n, the marginal gain in coverage remains positive and consistent with the reported log-linear trend up to the largest n we tested (n = 1000). We will also re-fit the exponentiated power law after removing duplicates to demonstrate that the scaling relationship is not driven by repeated identical samples. revision: yes
-
Referee: [SWE-bench Lite evaluation] The SWE-bench Lite result (15.9% → 56% at 250 samples) relies entirely on automatic verification to label samples as correct. No analysis of verifier false-positive rate, inter-sample consistency, or sensitivity to verifier errors is provided; any systematic mislabeling would inflate both coverage and the reported performance gain, undermining the claim that repeated sampling outperforms the single-sample SOTA.
Authors: We recognize that the reliability of the automatic verifier is central to interpreting the SWE-bench Lite gains. In the revision we will include a manual audit of 200 randomly selected samples that the verifier labeled as passing. We will report the observed false-positive rate, describe any systematic failure modes, and provide a sensitivity analysis showing how the headline 56 % figure changes under plausible error rates. We will also add inter-sample consistency statistics (e.g., the fraction of problems for which multiple independent samples receive the same verifier verdict). These additions will allow readers to assess the robustness of the performance improvement relative to the prior single-sample SOTA. revision: yes
Circularity Check
No circularity: purely empirical coverage measurements and curve fitting
full rationale
The paper's core claims rest on direct experimental measurements of coverage (fraction of problems solved by at least one sample) across increasing sample budgets on multiple tasks and models. Coverage is computed by generating independent samples and checking them against automatic verifiers where available; the log-linear relationship is then fit post-hoc with an exponentiated power law to the observed points. No derivation chain exists that reduces a claimed prediction or first-principles result back to fitted parameters or self-citations by construction. The scaling observation is reported as an empirical finding, not as a theorem or closed-form prediction derived from the same data it describes.
Axiom & Free-Parameter Ledger
free parameters (1)
- power law exponent
axioms (1)
- domain assumption The generated samples are sufficiently diverse and independent to allow coverage to increase with more samples.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclearwe observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclearWhen we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples
Forward citations
Cited by 46 Pith papers
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
Regulating Branch Parallelism in LLM Serving
TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
-
Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization
Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
-
When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation
The voting curve from repeated binary predictions is exactly equivalent to a signed voting signature capturing excess latent mass above the majority threshold at binomial variance scales, via signed Hausdorff moments.
-
StoryAlign: Evaluating and Training Reward Models for Story Generation
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
-
Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals
DASE adaptively stops LLM ensemble deliberation on detected consensus, matching fixed-budget accuracy with one-tenth the bandwidth and providing commit signals complementary to verbalized model confidence.
-
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
-
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Two calls identify the first two moments of per-example correctness probability, enabling exact distribution-free bounds on majority-vote accuracy for any budget.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
-
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
Engagement Process: Rethinking the Temporal Interface of Action and Observation
Engagement Process decouples actions and observations into separate time-based event streams within a POMDP structure to explicitly model timing mismatches, deliberation latency, and multi-rate interactions.
-
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
-
What should post-training optimize? A test-time scaling law perspective
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
-
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
APPS approximates power sampling for LLM reasoning via parallel particle propagation with future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs on benchmarks.
-
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training...
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
A multimodal diffusion model trained on synthetic data enhances low-resolution EBSD and corrupted polarized light data, achieving near full-resolution performance with only 25% EBSD data.
-
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
When Independent Sampling Outperforms Agentic Reasoning
On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling
EXPO improves GRPO via accuracy-conditioned KL scaling and Gaussian curriculum sampling centered at 0.5 accuracy, delivering gains up to 13.34 points on AIME 2025 pass@32 and 2.66 average on 8B models.
-
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.
Reference graph
Works this paper leans on
- [1]
-
[2]
URL https://openai.com/index/hello-gpt-4o/
Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/
work page 2024
-
[3]
URL https://llama.meta.com/llama3/
Meta llama 3, 2024. URL https://llama.meta.com/llama3/
work page 2024
-
[4]
URL https://www.anthropic.com/news/claude-3-5-sonnet
Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet
work page 2024
- [5]
-
[6]
Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, and Bing Xiang. Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, 2024. URL https://arxiv.org/abs/2403.08845
-
[7]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page 2022
-
[8]
In: Wooldridge, M.J., Dy, J.G., Natarajan, S
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, March 2024. ISS...
-
[9]
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373
-
[10]
Combining deep reinforcement learning and search for imperfect-information games
Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
work page 2020
-
[11]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[12]
Deep blue.Artificial Intelligence, 134(1):57–83, 2002
Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. Deep blue. Artif. Intell., 134(1–2):57–83, jan 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1. URL https://doi.org/10.1016/ S0004-3702(01)00129-1
-
[13]
Alphamath almost zero: process supervision without process, 2024
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process, 2024. 14
work page 2024
-
[14]
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. Are more llm calls all you need? towards scaling laws of compound inference systems, 2024. URL https://arxiv.org/abs/2403.02419
-
[15]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
On the Measure of Intelligence
Fran¸ cois Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547
work page internal anchor Pith review arXiv 2019
-
[17]
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2017. URL https://arxiv.org/abs/1706.03741
-
[18]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
work page 2021
-
[19]
Networks of networks: Complexity class principles applied to compound ai systems design, 2024
Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, and Matei Zaharia. Networks of networks: Complexity class principles applied to compound ai systems design, 2024. URL https: //arxiv.org/abs/2407.16831
-
[20]
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,
DeepSeek-AI et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model,
-
[21]
URL https://arxiv.org/abs/2405.04434
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency misnomer,
- [23]
-
[24]
A framework for few-shot language model evaluation, 12 2023
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
-
[25]
Ryan Greenblatt. Geting 50 https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/ getting-50-sota-on-arc-agi-with-gpt-4o , 2024
work page 2024
-
[26]
The larger the better? improved llm code-generation via budget reallocation, 2024
Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, and Yossi Adi. The larger the better? improved llm code-generation via budget reallocation, 2024. URL https://arxiv.org/abs/2404.00725
-
[27]
Measuring coding challenge competence with apps, 2021
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps, 2021
work page 2021
-
[28]
Measuring mathematical problem solving with the math dataset, 2021
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021
work page 2021
-
[29]
Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically,
-
[30]
URL https://arxiv.org/abs/1712.00409. 15
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
V-star: Training verifiers for self-taught reasoners, 2024
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners, 2024
work page 2024
-
[33]
Rewarding chatbots for real-world engagement with millions of users, 2023
Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, and William Beauchamp. Rewarding chatbots for real-world engagement with millions of users, 2023. URL https://arxiv.org/abs/2303.06135
-
[34]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023. URL https://arxiv.org/abs/2306.02561
-
[35]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https: //arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [36]
-
[37]
Hydragen: High-throughput llm inference with shared prefixes
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher R´ e, and Azalia Mirhoseini. Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099 , 2024
-
[38]
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time,
Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, and Jun Yao. Mindstar: Enhancing math reasoning in pre-trained llms at inference time, 2024. URL https://arxiv.org/abs/2405.16265
-
[39]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020
work page 2020
-
[40]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[41]
Spoc: Search-based pseudocode to code, 2019
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy Liang. Spoc: Search-based pseudocode to code, 2019. URL https://arxiv.org/abs/1906.04908
-
[42]
Smith, and Hannaneh Hajishirzi
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Reward- bench: Evaluating reward models for language modeling, 2024. URL https://arxiv.org/abs/2403. 13787
work page 2024
-
[43]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858
work page internal anchor Pith review arXiv 2022
-
[44]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...
-
[45]
Let’s verify step by step, 2023
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023
work page 2023
-
[46]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
When is the consistent prediction likely to be a correct prediction?, 2024
Alex Nguyen, Dheeraj Mekala, Chengyu Dong, and Jingbo Shang. When is the consistent prediction likely to be a correct prediction?, 2024. URL https://arxiv.org/abs/2407.05778
-
[48]
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. URL https: //arxiv.org/abs/2406.18665
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
OpenAI et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[51]
Code llama: Open foundation models for code, 2023
Baptiste Rozi` ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, J´ er´ emy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D´ efossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Mart...
work page 2023
-
[52]
Scaling retrieval-based language models with a trillion-token datastore, 2024
Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, and Pang Wei Koh. Scaling retrieval-based language models with a trillion-token datastore, 2024. URL https://arxiv.org/abs/2407.12854
-
[53]
Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017
work page 2017
-
[54]
The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, 2024
Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, 2024. URL https://arxiv.org/abs/2407.10457
-
[55]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team et al. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
arXiv preprint arXiv:2404.12253 , year=
Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self- improvement of llms via imagination, searching, and criticizing, 2024. URL https://arxiv.org/abs/ 2404.12253
-
[57]
Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024. ISSN 1476-4687. doi: 10.1038/ s41586-023-06747-5. URL https://doi.org/10.1038/s41586-023-06747-5
-
[58]
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St´ efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric W....
-
[59]
Knowledge fusion of large language models, 2024
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models, 2024. URL https://arxiv.org/abs/2401.10491
-
[60]
Interpretable preferences via multi-objective reward modeling and mixture-of-experts, 2024
Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts, 2024. URL https://arxiv.org/abs/2406. 12845
work page 2024
-
[61]
arXiv preprint arXiv:2406.04692 , year=
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024. URL https://arxiv.org/abs/2406.04692
- [62]
-
[63]
Self-consistency improves chain of thought reasoning in language models, 2023
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023
work page 2023
-
[64]
Chain-of-thought prompting elicits reasoning in large language models, 2023
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023
work page 2023
-
[65]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. URL https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [67]
-
[68]
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110 , 2021
-
[69]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URL https://arxiv.org/abs/2312.07104
work page internal anchor Pith review arXiv 2024
-
[70]
Albert ¨Orwall. Moatless tools. https://github.com/aorwall/moatless-tools/tree/ a1017b78e3e69e7d205b1a3faa83a7d19fce3fa6, 2024. 18 A Sampling Experimental Setup A.1 Lean Formal Proofs We report results on the 130 questions in the test set of the lean4 MiniF2F dataset that correspond to formalized MATH problems. This dataset is derived from the fixed versi...
work page 2024
-
[71]
Header imports present in each problem in the HuggingFace dataset cat-searcher/minif2f-lean4 dataset, an upload of the lean4 MiniF2F dataset
-
[72]
The theorem definition. In order to avoid leaking information about how to solve the theorem from its name, we replace the name of the theorem with theorem_i. i ∈ {1, 2, 3, 4, 5} for the few-shot examples and i = 6 for the current problem. We set 200 as the max token length for the generated solution. To grade solutions, we use the lean-dojo 1.1.2 library...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.