pith. machine review for the scientific record. sign in

arxiv: 2103.03874 · v2 · submitted 2021-03-05 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks , Collin Burns , Saurav Kadavath , Akul Arora , Steven Basart , Eric Tang , Dawn Song , Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords MATH datasetmathematical problem solvingtransformer modelsscaling lawscompetition mathematicsmachine learning benchmarks
0
0 comments X

The pith

The MATH dataset shows that scaling up Transformer models is insufficient for strong mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the MATH dataset to measure how well AI models can solve challenging competition-level math problems. It provides 12,500 problems each with full step-by-step solutions and an auxiliary pretraining dataset to help models learn math fundamentals. Testing reveals that even very large models achieve only low accuracy, and trends suggest that simply making models bigger or using more compute will not lead to high performance. The work argues that new algorithmic advancements are needed beyond current scaling approaches.

Core claim

We introduce MATH, a dataset of 12,500 challenging competition mathematics problems with full step-by-step solutions. Despite increasing accuracy with larger models and pretraining, accuracy remains relatively low even with enormous Transformers, and scaling trends indicate it will be impractical to achieve strong mathematical reasoning without new algorithmic changes.

What carries the argument

The MATH dataset, consisting of 12,500 competition math problems each paired with a detailed solution, used to evaluate model performance on mathematical problem solving.

If this is right

  • Current scaling of model size and compute will not suffice to solve advanced math problems effectively.
  • New algorithmic innovations from the research community will be necessary for progress in mathematical reasoning.
  • Models trained on the auxiliary pretraining dataset can improve but still fall short on MATH.
  • Step-by-step solutions in the dataset can be used to train models to generate explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Progress on MATH may require techniques that go beyond pattern matching in large datasets, such as symbolic reasoning or verification methods.
  • If scaling continues to underperform on MATH, it could indicate limitations in how Transformers process mathematical structures compared to other tasks.
  • Future benchmarks might need to incorporate more diverse or harder problems to track true advances in reasoning.

Load-bearing premise

That the MATH problems are a faithful and comprehensive measure of general mathematical problem-solving ability and that the observed performance trends with model scale will continue without new algorithmic changes.

What would settle it

Demonstrating a Transformer-based model that achieves high accuracy on the MATH dataset through scaling alone, without novel algorithms, would falsify the claim that scaling is impractical for strong mathematical reasoning.

read the original abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MATH dataset of 12,500 competition-level mathematics problems, each with a full step-by-step solution, together with a large auxiliary pretraining corpus of mathematical text. It evaluates a range of Transformer models on MATH, reports that final-answer accuracy remains low even for the largest models tested, and concludes that continued scaling of model size and compute will be insufficient to reach strong mathematical reasoning performance if current trends persist, thereby calling for new algorithmic advances.

Significance. If the empirical measurements hold, the work supplies a demanding, well-documented benchmark that exposes clear limitations of pure scaling for mathematical reasoning, a domain where progress has lagged behind other text tasks. The public release of both MATH and the auxiliary pretraining data constitutes a concrete, reusable resource that can accelerate follow-on research; the scaling observations, while subject to the extrapolation concern below, provide a useful baseline for future comparisons.

major comments (1)
  1. [Abstract and scaling-results section] Abstract and the scaling-results section: the central claim that 'simply increasing budgets and model parameter counts will be impractical … if scaling trends continue' depends on extrapolating the observed accuracy-versus-size relationship beyond the tested range. The manuscript does not specify the functional form fitted to the data, does not report confidence intervals or cross-validation of that form, and does not examine whether a change in exponent or the onset of saturation would alter the impracticality conclusion while leaving the raw accuracy numbers unchanged.
minor comments (2)
  1. [Evaluation setup] The evaluation protocol should explicitly state whether models are assessed only on final-answer correctness or also on the correctness of the generated step-by-step derivations; the current description leaves this ambiguous.
  2. [Results figures] Table or figure captions for the scaling plots should include the exact model sizes, training budgets, and number of runs used to generate each point.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract and scaling-results section] Abstract and the scaling-results section: the central claim that 'simply increasing budgets and model parameter counts will be impractical … if scaling trends continue' depends on extrapolating the observed accuracy-versus-size relationship beyond the tested range. The manuscript does not specify the functional form fitted to the data, does not report confidence intervals or cross-validation of that form, and does not examine whether a change in exponent or the onset of saturation would alter the impracticality conclusion while leaving the raw accuracy numbers unchanged.

    Authors: We agree that the extrapolation underlying the claim would be strengthened by greater statistical rigor. The original manuscript presents the scaling results via a figure of accuracy versus model size (parameter count) for a range of Transformer models and notes the slow observed trend, but does not explicitly state a functional form, report fit statistics, or conduct sensitivity checks. In the revision we will add the following: (1) we model the relationship as a power law via ordinary least-squares linear regression on log-log axes and report the fitted exponent, intercept, and R²; (2) we supply bootstrap confidence intervals on the fitted parameters and on the extrapolated accuracies at larger scales; (3) we include a sensitivity analysis that varies the exponent by ±25 % around the fitted value and considers an earlier onset of saturation. Even under the most optimistic of these variants, the model sizes required to reach, for example, 50 % accuracy remain on the order of 10¹²–10¹³ parameters—well beyond practical limits. These additions will be placed in the scaling-results section and referenced from the abstract; the raw accuracy numbers and the qualitative conclusion that scaling alone is insufficient are unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark with observational claims

full rationale

The paper introduces the MATH dataset, reports direct empirical accuracies for Transformer models of varying sizes after pretraining on an auxiliary math corpus, and observes that accuracy remains low even at large scales. No equations, derivations, or fitted functional forms are presented that reduce by construction to the paper's own inputs or self-citations; the scaling-trend remark is a qualitative extrapolation from measured points rather than a self-referential prediction. The work is self-contained against external benchmarks because its central results consist of reproducible evaluations on a newly released dataset whose problems and solutions are independent of any internal model parameters or prior author theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical dataset and benchmarking paper with no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5481 in / 912 out tokens · 44708 ms · 2026-05-10T12:55:34.232843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.PhiForcing phi_forcing unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    cs.CL 2022-01 accept novelty 9.0

    Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

  2. Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

    cs.CL 2026-04 unverdicted novelty 8.0

    A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

  3. PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

    q-fin.CP 2026-04 conditional novelty 8.0

    Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

  4. Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

    cs.AI 2026-04 unverdicted novelty 8.0

    User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

  5. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

    cs.AI 2026-03 conditional novelty 8.0

    SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

  6. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  7. Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 7.0

    FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster ...

  8. Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

    cs.CL 2026-05 unverdicted novelty 7.0

    TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...

  9. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.

  10. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  11. GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.

  12. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  13. TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

    cs.CL 2026-05 unverdicted novelty 7.0

    TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.

  14. BadDLM: Backdooring Diffusion Language Models with Diverse Targets

    cs.CR 2026-05 unverdicted novelty 7.0

    BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.

  15. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  16. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

  17. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

  18. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  19. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.

  20. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  21. Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

  22. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  23. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

  24. Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

  25. Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

    cs.LG 2026-05 unverdicted novelty 7.0

    Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...

  26. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  27. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  28. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  29. Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast

    cs.CL 2026-05 unverdicted novelty 7.0

    FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.

  30. SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials

    cs.AI 2026-04 unverdicted novelty 7.0

    SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.

  31. Can Multimodal Large Language Models Truly Understand Small Objects?

    cs.CV 2026-04 unverdicted novelty 7.0

    Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.

  32. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  33. $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

    cs.CL 2026-04 unverdicted novelty 7.0

    R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.

  34. Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

    cs.AI 2026-04 unverdicted novelty 7.0

    WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

  35. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  36. Towards Unconstrained Human-Object Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

  37. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  38. TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

    cs.CL 2026-04 unverdicted novelty 7.0

    TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.

  39. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  40. SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

    cs.AI 2026-04 unverdicted novelty 7.0

    SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

  41. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  42. S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

    cs.CL 2026-04 conditional novelty 7.0

    S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.

  43. MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

    cs.CL 2026-04 conditional novelty 7.0

    Math-PT provides 1,729 native Portuguese math problems and shows frontier LLMs perform well on multiple-choice but drop on figures and open-ended items.

  44. RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)

    cs.CY 2026-03 unverdicted novelty 7.0

    RoMathExam supplies a century-long collection of Romanian math exams together with a new intrinsic complexity metric that correlates across frontier models at r > 0.72.

  45. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.

  46. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  47. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  48. Let's Verify Step by Step

    cs.LG 2023-05 accept novelty 7.0

    Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

  49. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    cs.CL 2022-11 unverdicted novelty 7.0

    PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

  50. PreFT: Prefill-only finetuning for efficient inference

    cs.LG 2026-05 accept novelty 6.0

    Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

  51. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  52. Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

  53. Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs

    cs.CV 2026-05 accept novelty 6.0

    A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.

  54. Search Your Block Floating Point Scales!

    cs.LG 2026-05 unverdicted novelty 6.0

    ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

  55. Scalable Token-Level Hallucination Detection in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...

  56. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  57. Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

    cs.CL 2026-05 unverdicted novelty 6.0

    Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.

  58. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  59. SOMA: Efficient Multi-turn LLM Serving via Small Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

  60. Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

    cs.LG 2026-05 unverdicted novelty 6.0

    SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 174 Pith papers

  1. [1]

    rationales are noisy, incomplete and sometimes incorrect

    and claims AQuA-RATs “rationales are noisy, incomplete and sometimes incorrect.” MathQA then cleans AQuA-RAT, though cleaning led the dataset size to be reduced by half of an order of magnitude. Miao et al. (2020) analyze MathQA and observe “the annotated formulas of 27% of the problems do not match their labeled answers,” and they obtain 86% accuracy on ...

  2. [2]

    a, a, z, c, y, e, x, _

    models of various sizes. While enormous Transformers perform poorly on MATH, they do well on other logic and intelligence tests. We analyze Transformers on LogiQA (Liu et al., 2020), a task with logical reasoning questions such as “David knows Mr. Zhang’s friend Jack, and Jack knows David’s friend Ms. Lin. Everyone of them who knows Jack has a master’s de...

  3. [3]

    A 6-sided die is weighted so that the probability of any number being rolled is proportional to the value of the roll. (So, for example, the probability of a 2 being rolled is twice that of a 1 being rolled.) What is the expected value of a roll of this weighted die? Express your answer as a common fraction

  4. [4]

    The square of what other number is 225?

    The square of 15 is 225. The square of what other number is 225?

  5. [5]

    Find the sum of all values of x such that|x− 1| = 7

  6. [6]

    What is c−a? Express your answer as a common fraction

    The parabolas defined by the equationsy =−x2−x + 1 andy = 2x2− 1 intersect at points (a,b ) and (c,d ), wherec≥a. What is c−a? Express your answer as a common fraction

  7. [7]

    If a = 8, what is the value of ( 16 3√ a2 )1 3 ?

  8. [8]

    Findp(7)

    Letp(x) be a cubic polynomial such that p(2) = 0,p (−1) = 0,p (4) = 6 , andp(5) = 8 . Findp(7)

  9. [9]

    We say thatz∈S is a unit if there exists aw∈S such thatzw = 1

    LetS be the set of complex numbers of the forma +bi, wherea andb are integers. We say thatz∈S is a unit if there exists aw∈S such thatzw = 1. Find the number of units in S

  10. [10]

    Find the remainder when 1 + 2 + 22 + 23 +··· + 2100 is divided by 7

  11. [11]

    If the perimeter of the rectangle is 76 feet, how many square feet are in the area of the rectangle?

    The length of a rectangle is 3x + 10 feet and its width isx + 12 feet. If the perimeter of the rectangle is 76 feet, how many square feet are in the area of the rectangle?

  12. [12]

    Four of the seats are broken

    A European train compartment has six seats. Four of the seats are broken. Wilhelm needs to fill out a form to indicate that there are broken seats. If he randomly checks off four of the seats in the diagram, what is the probability that he marked the correct seats? Express your answer as a common fraction

  13. [13]

    Let M be the midpoint ofAB

    We have a triangle△ABC where AC = 17 , BC = 15 , and AB = 8 . Let M be the midpoint ofAB. What is the length ofCM ?

  14. [14]

    Problem Length Precalculus Level 1 (a) Subject accuracy vs problem length

    Ifn gives a remainder of 3 when divided by 7, then what remainder does 2n + 1 give when divided by 7? 16 100 150 200 250 Average Problem Length (characters) 0 5 10 15 20 25 30Accuracy (%) Accuracy vs. Problem Length Precalculus Level 1 (a) Subject accuracy vs problem length. Each point represents a subject at a specific difficulty level. We exclude problems...

  15. [15]

    In how many ways can we choose the officers, if individual members are allowed to hold2, but not all 3, offices?

    Our club has 25 members, and wishes to pick a president, secretary, and treasurer. In how many ways can we choose the officers, if individual members are allowed to hold2, but not all 3, offices?

  16. [16]

    Find the minimum possible value of √ 58− 42x + √ 149− 140 √ 1−x2 where−1≤x≤ 1?

  17. [17]

    Finda +b +c

    Let a,b , andc be the roots ofx3 + 7x2− 11x− 2 = 0. Finda +b +c

  18. [18]

    Given thatH andC intersect at four points, what is the area of the quadrilateral formed by the four points?

    LetH be the hyperbola with foci at (±5, 0) and vertices at (±3, 0), and letC be the circle with center (0, 0) and radius 4. Given thatH andC intersect at four points, what is the area of the quadrilateral formed by the four points?

  19. [19]

    If f(x) =x2− 2x + 1 andg(x) =√2x + 1 what is the value off(g(4))−g(f(3))?

  20. [20]

    Find the value of r such that 6r2−19r−7 2r−7 = 4r− 3

  21. [21]

    What is the value ofx?

    Forx> 0, the area of the triangle with vertices (0, 0), (x, 0) and (x, 5) is 30 square units. What is the value ofx?

  22. [22]

    at least one

    Find the units digit of the following within the indicated number base: 4136− 2156. B Checklist Information Legal Compliance. We create and collect various mathematics problems to create MATH and AMPS. AMPS consists of problems generated with Mathematica and Khan Academy code. Mathematica serves as a calculator and does not copyright its numerical answer ...