arxiv: 2103.03874 · v2 · submitted 2021-03-05 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks , Collin Burns , Saurav Kadavath , Akul Arora , Steven Basart , Eric Tang , Dawn Song , Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords MATH datasetmathematical problem solvingtransformer modelsscaling lawscompetition mathematicsmachine learning benchmarks

0 comments

The pith

The MATH dataset shows that scaling up Transformer models is insufficient for strong mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the MATH dataset to measure how well AI models can solve challenging competition-level math problems. It provides 12,500 problems each with full step-by-step solutions and an auxiliary pretraining dataset to help models learn math fundamentals. Testing reveals that even very large models achieve only low accuracy, and trends suggest that simply making models bigger or using more compute will not lead to high performance. The work argues that new algorithmic advancements are needed beyond current scaling approaches.

Core claim

We introduce MATH, a dataset of 12,500 challenging competition mathematics problems with full step-by-step solutions. Despite increasing accuracy with larger models and pretraining, accuracy remains relatively low even with enormous Transformers, and scaling trends indicate it will be impractical to achieve strong mathematical reasoning without new algorithmic changes.

What carries the argument

The MATH dataset, consisting of 12,500 competition math problems each paired with a detailed solution, used to evaluate model performance on mathematical problem solving.

If this is right

Current scaling of model size and compute will not suffice to solve advanced math problems effectively.
New algorithmic innovations from the research community will be necessary for progress in mathematical reasoning.
Models trained on the auxiliary pretraining dataset can improve but still fall short on MATH.
Step-by-step solutions in the dataset can be used to train models to generate explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Progress on MATH may require techniques that go beyond pattern matching in large datasets, such as symbolic reasoning or verification methods.
If scaling continues to underperform on MATH, it could indicate limitations in how Transformers process mathematical structures compared to other tasks.
Future benchmarks might need to incorporate more diverse or harder problems to track true advances in reasoning.

Load-bearing premise

That the MATH problems are a faithful and comprehensive measure of general mathematical problem-solving ability and that the observed performance trends with model scale will continue without new algorithmic changes.

What would settle it

Demonstrating a Transformer-based model that achieves high accuracy on the MATH dataset through scaling alone, without novel algorithms, would falsify the claim that scaling is impractical for strong mathematical reasoning.

read the original abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MATH gives the field a solid new hard-math benchmark with solutions, but the claim that scaling alone will be impractical rests on untested extrapolation.

read the letter

The main thing to know is that this paper releases a new dataset of 12,500 competition-level math problems, each with full step-by-step solutions, plus an auxiliary pretraining corpus for math basics. That is concrete new material that was not available before and directly addresses the lack of large, challenging benchmarks in this area. The experiments then run a range of models on it and show that even large Transformers still achieve low accuracy, with the pretraining data providing some lift. The inclusion of solutions is a practical plus for anyone who wants to train models to produce derivations rather than just final answers. These pieces are useful and straightforward to build on. The softer part is the stronger conclusion that simply increasing model size and compute budgets will be impractical if current trends continue. The paper only has accuracy numbers from a handful of model sizes up to a few billion parameters, with no separate check on whether the shape of the scaling curve stays the same at 10x or 100x larger scales or under different training regimes. If the relationship changes, the impracticality claim does not hold while the raw accuracy numbers remain correct. The assumption that these particular problems are a faithful stand-in for general mathematical problem solving is also taken as given rather than examined in detail. This work is aimed at researchers who build or evaluate models for reasoning tasks. Anyone who needs a harder public math benchmark or baselines on it will find value in the artifacts themselves. The dataset contribution is substantial enough that the paper deserves a serious referee, though the review should focus on the scaling extrapolation and the representativeness of the problem set. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MATH dataset of 12,500 competition-level mathematics problems, each with a full step-by-step solution, together with a large auxiliary pretraining corpus of mathematical text. It evaluates a range of Transformer models on MATH, reports that final-answer accuracy remains low even for the largest models tested, and concludes that continued scaling of model size and compute will be insufficient to reach strong mathematical reasoning performance if current trends persist, thereby calling for new algorithmic advances.

Significance. If the empirical measurements hold, the work supplies a demanding, well-documented benchmark that exposes clear limitations of pure scaling for mathematical reasoning, a domain where progress has lagged behind other text tasks. The public release of both MATH and the auxiliary pretraining data constitutes a concrete, reusable resource that can accelerate follow-on research; the scaling observations, while subject to the extrapolation concern below, provide a useful baseline for future comparisons.

major comments (1)

[Abstract and scaling-results section] Abstract and the scaling-results section: the central claim that 'simply increasing budgets and model parameter counts will be impractical … if scaling trends continue' depends on extrapolating the observed accuracy-versus-size relationship beyond the tested range. The manuscript does not specify the functional form fitted to the data, does not report confidence intervals or cross-validation of that form, and does not examine whether a change in exponent or the onset of saturation would alter the impracticality conclusion while leaving the raw accuracy numbers unchanged.

minor comments (2)

[Evaluation setup] The evaluation protocol should explicitly state whether models are assessed only on final-answer correctness or also on the correctness of the generated step-by-step derivations; the current description leaves this ambiguous.
[Results figures] Table or figure captions for the scaling plots should include the exact model sizes, training budgets, and number of runs used to generate each point.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and scaling-results section] Abstract and the scaling-results section: the central claim that 'simply increasing budgets and model parameter counts will be impractical … if scaling trends continue' depends on extrapolating the observed accuracy-versus-size relationship beyond the tested range. The manuscript does not specify the functional form fitted to the data, does not report confidence intervals or cross-validation of that form, and does not examine whether a change in exponent or the onset of saturation would alter the impracticality conclusion while leaving the raw accuracy numbers unchanged.

Authors: We agree that the extrapolation underlying the claim would be strengthened by greater statistical rigor. The original manuscript presents the scaling results via a figure of accuracy versus model size (parameter count) for a range of Transformer models and notes the slow observed trend, but does not explicitly state a functional form, report fit statistics, or conduct sensitivity checks. In the revision we will add the following: (1) we model the relationship as a power law via ordinary least-squares linear regression on log-log axes and report the fitted exponent, intercept, and R²; (2) we supply bootstrap confidence intervals on the fitted parameters and on the extrapolated accuracies at larger scales; (3) we include a sensitivity analysis that varies the exponent by ±25 % around the fitted value and considers an earlier onset of saturation. Even under the most optimistic of these variants, the model sizes required to reach, for example, 50 % accuracy remain on the order of 10¹²–10¹³ parameters—well beyond practical limits. These additions will be placed in the scaling-results section and referenced from the abstract; the raw accuracy numbers and the qualitative conclusion that scaling alone is insufficient are unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark with observational claims

full rationale

The paper introduces the MATH dataset, reports direct empirical accuracies for Transformer models of varying sizes after pretraining on an auxiliary math corpus, and observes that accuracy remains low even at large scales. No equations, derivations, or fitted functional forms are presented that reduce by construction to the paper's own inputs or self-citations; the scaling-trend remark is a qualitative extrapolation from measured points rather than a self-referential prediction. The work is self-contained against external benchmarks because its central results consist of reproducible evaluations on a newly released dataset whose problems and solutions are independent of any internal model parameters or prior author theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical dataset and benchmarking paper with no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5481 in / 912 out tokens · 44708 ms · 2026-05-10T12:55:34.232843+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.PhiForcing phi_forcing unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
cs.CL 2026-04 unverdicted novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
q-fin.CP 2026-04 conditional novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster ...
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
cs.CL 2026-05 unverdicted novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
AIS: Adaptive Importance Sampling for Quantized RL
stat.ML 2026-05 unverdicted novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
cs.CR 2026-05 unverdicted novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
Test-Time Speculation
cs.CL 2026-05 unverdicted novelty 7.0

Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
cs.LG 2026-05 unverdicted novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 7.0

Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
cs.AI 2026-04 unverdicted novelty 7.0

SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
cs.CL 2026-04 unverdicted novelty 7.0

R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
cs.LG 2026-04 unverdicted novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
Towards Unconstrained Human-Object Interaction
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
cs.CL 2026-04 unverdicted novelty 7.0

TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
cs.CL 2026-04 conditional novelty 7.0

S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
cs.CL 2026-04 conditional novelty 7.0

Math-PT provides 1,729 native Portuguese math problems and shows frontier LLMs perform well on multiple-choice but drop on figures and open-ended items.
RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
cs.CY 2026-03 unverdicted novelty 7.0

RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.
Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
PreFT: Prefill-only finetuning for efficient inference
cs.LG 2026-05 accept novelty 6.0

Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
cs.CV 2026-05 accept novelty 6.0

A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
H\"older Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
cs.CL 2026-05 unverdicted novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
cs.CL 2026-05 unverdicted novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 174 Pith papers

[1]

rationales are noisy, incomplete and sometimes incorrect

and claims AQuA-RATs “rationales are noisy, incomplete and sometimes incorrect.” MathQA then cleans AQuA-RAT, though cleaning led the dataset size to be reduced by half of an order of magnitude. Miao et al. (2020) analyze MathQA and observe “the annotated formulas of 27% of the problems do not match their labeled answers,” and they obtain 86% accuracy on ...

work page 2020
[2]

a, a, z, c, y, e, x, _

models of various sizes. While enormous Transformers perform poorly on MATH, they do well on other logic and intelligence tests. We analyze Transformers on LogiQA (Liu et al., 2020), a task with logical reasoning questions such as “David knows Mr. Zhang’s friend Jack, and Jack knows David’s friend Ms. Lin. Everyone of them who knows Jack has a master’s de...

work page 2020
[3]

A 6-sided die is weighted so that the probability of any number being rolled is proportional to the value of the roll. (So, for example, the probability of a 2 being rolled is twice that of a 1 being rolled.) What is the expected value of a roll of this weighted die? Express your answer as a common fraction

work page
[4]

The square of what other number is 225?

The square of 15 is 225. The square of what other number is 225?

work page
[5]

Find the sum of all values of x such that|x− 1| = 7

work page
[6]

What is c−a? Express your answer as a common fraction

The parabolas deﬁned by the equationsy =−x2−x + 1 andy = 2x2− 1 intersect at points (a,b ) and (c,d ), wherec≥a. What is c−a? Express your answer as a common fraction

work page
[7]

If a = 8, what is the value of ( 16 3√ a2 )1 3 ?

work page
[8]

Findp(7)

Letp(x) be a cubic polynomial such that p(2) = 0,p (−1) = 0,p (4) = 6 , andp(5) = 8 . Findp(7)

work page
[9]

We say thatz∈S is a unit if there exists aw∈S such thatzw = 1

LetS be the set of complex numbers of the forma +bi, wherea andb are integers. We say thatz∈S is a unit if there exists aw∈S such thatzw = 1. Find the number of units in S

work page
[10]

Find the remainder when 1 + 2 + 22 + 23 +··· + 2100 is divided by 7

work page
[11]

If the perimeter of the rectangle is 76 feet, how many square feet are in the area of the rectangle?

The length of a rectangle is 3x + 10 feet and its width isx + 12 feet. If the perimeter of the rectangle is 76 feet, how many square feet are in the area of the rectangle?

work page
[12]

Four of the seats are broken

A European train compartment has six seats. Four of the seats are broken. Wilhelm needs to ﬁll out a form to indicate that there are broken seats. If he randomly checks off four of the seats in the diagram, what is the probability that he marked the correct seats? Express your answer as a common fraction

work page
[13]

Let M be the midpoint ofAB

We have a triangle△ABC where AC = 17 , BC = 15 , and AB = 8 . Let M be the midpoint ofAB. What is the length ofCM ?

work page
[14]

Problem Length Precalculus Level 1 (a) Subject accuracy vs problem length

Ifn gives a remainder of 3 when divided by 7, then what remainder does 2n + 1 give when divided by 7? 16 100 150 200 250 Average Problem Length (characters) 0 5 10 15 20 25 30Accuracy (%) Accuracy vs. Problem Length Precalculus Level 1 (a) Subject accuracy vs problem length. Each point represents a subject at a speciﬁc difﬁculty level. We exclude problems...

work page
[15]

In how many ways can we choose the ofﬁcers, if individual members are allowed to hold2, but not all 3, ofﬁces?

Our club has 25 members, and wishes to pick a president, secretary, and treasurer. In how many ways can we choose the ofﬁcers, if individual members are allowed to hold2, but not all 3, ofﬁces?

work page
[16]

Find the minimum possible value of √ 58− 42x + √ 149− 140 √ 1−x2 where−1≤x≤ 1?

work page
[17]

Finda +b +c

Let a,b , andc be the roots ofx3 + 7x2− 11x− 2 = 0. Finda +b +c

work page
[18]

Given thatH andC intersect at four points, what is the area of the quadrilateral formed by the four points?

LetH be the hyperbola with foci at (±5, 0) and vertices at (±3, 0), and letC be the circle with center (0, 0) and radius 4. Given thatH andC intersect at four points, what is the area of the quadrilateral formed by the four points?

work page
[19]

If f(x) =x2− 2x + 1 andg(x) =√2x + 1 what is the value off(g(4))−g(f(3))?

work page
[20]

Find the value of r such that 6r2−19r−7 2r−7 = 4r− 3

work page
[21]

What is the value ofx?

Forx> 0, the area of the triangle with vertices (0, 0), (x, 0) and (x, 5) is 30 square units. What is the value ofx?

work page
[22]

at least one

Find the units digit of the following within the indicated number base: 4136− 2156. B Checklist Information Legal Compliance. We create and collect various mathematics problems to create MATH and AMPS. AMPS consists of problems generated with Mathematica and Khan Academy code. Mathematica serves as a calculator and does not copyright its numerical answer ...

work page