pith. sign in

arxiv: 2409.12122 · v1 · submitted 2024-09-18 · 💻 cs.CL · cs.AI· cs.LG

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Pith reviewed 2026-05-11 10:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mathematical reasoningself-improvementlarge language modelsreward modelsupervised fine-tuningreinforcement learningchain-of-thoughttool-integrated reasoning
0
0 comments X

The pith

Integrating self-improvement across pre-training, post-training, and inference produces math-specialized models with stronger reasoning on competition problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The report describes a series of Qwen2.5-Math models built by embedding self-improvement into the full development pipeline. An earlier model generates large-scale math data for pre-training. A reward model trained on massive samples then filters and iterates supervised fine-tuning data, with the improved model in turn training a better reward model for the next round. The final model undergoes reinforcement learning guided by the reward model, and the same model directs sampling at inference time to optimize outputs. If successful, this loop offers a path to domain-specialized models that generate and refine their own higher-quality training data without heavy reliance on external curation. Readers would care because it tests whether iterative filtering can steadily advance performance on tasks ranging from grade-school arithmetic to advanced competition problems in both English and Chinese.

Core claim

The authors claim that a closed self-improvement loop—using the current model to generate data, scoring it with a reward model derived from prior samples, and retraining—yields progressive gains in mathematical capability. This cycle runs through pre-training data creation, multiple rounds of supervised fine-tuning data evolution, reinforcement learning, and reward-guided inference, resulting in models that handle both chain-of-thought and tool-integrated reasoning on English and Chinese math benchmarks.

What carries the argument

The reward model obtained from massive sampling of model outputs, which filters data for iterative supervised fine-tuning, guides reinforcement learning, and steers inference sampling.

If this is right

  • The final models support both chain-of-thought and tool-integrated reasoning on grade-school to competition-level problems.
  • Iterative reward-model updates allow each stronger supervised fine-tuning model to train an improved reward model for the next cycle.
  • Reinforcement learning on the final supervised model uses the ultimate reward model to further refine outputs.
  • Reward-guided sampling at inference time improves answer quality on the evaluated English and Chinese datasets.
  • The approach covers both Chinese and English mathematical reasoning across ten benchmarks of varying difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling-plus-filtering loop might be applied to other specialized domains if a reliable reward model can be built for those domains.
  • Success would imply that models can bootstrap expertise in reasoning-heavy fields with less dependence on human-labeled data.
  • If the cycle can be sustained without mode collapse, it raises the possibility of continued performance gains through repeated self-refinement rounds.
  • The bilingual capability suggests the method preserves or enhances cross-lingual transfer when data generation and filtering are applied to mixed-language corpora.

Load-bearing premise

Repeated sampling plus reward-model filtering produces steadily higher-quality math data without compounding errors or narrowing the model's output distribution.

What would settle it

A controlled experiment in which the iterative reward-model data evolution steps are removed and the resulting models show no gain or a loss on held-out competition benchmarks such as AIME24 or MATH would falsify the claim that the self-improvement pipeline drives the observed performance.

read the original abstract

In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents the Qwen2.5-Math series (1.5B/7B/72B) whose core contribution is integrating self-improvement across the full pipeline: pre-training data generation from Qwen2-Math-Instruct, post-training iterative RM training from model samples followed by SFT data evolution and RL, and inference-time RM-guided sampling. The models are evaluated on ten English and Chinese math benchmarks spanning grade-school to competition level (GSM8K, MATH, AIME24, etc.).

Significance. If the iterative self-improvement loop demonstrably improves data quality without compounding errors, the approach would provide a scalable, largely automated route to stronger mathematical reasoning models and reduce reliance on human-curated corpora. The manuscript currently supplies only end-to-end benchmark numbers, so the practical significance cannot yet be assessed.

major comments (3)
  1. [Post-training phase] Post-training phase (abstract and §3): the central claim that iterative RM-SFT evolution produces net gains in data quality rests on the untested assumption that repeated sampling plus RM filtering avoids distributional shift or reward hacking. No per-iteration quality metrics, error rates on generated traces, or control runs that hold total tokens fixed while disabling iteration are reported.
  2. [Abstract and post-training] Abstract and post-training description: the RM is first trained on samples from Qwen2-Math-Instruct and then used to filter the next SFT round, creating an explicit circular dependency; without reported RM accuracy on held-out human data, agreement statistics, or analysis of mode collapse, it is impossible to verify that the loop improves rather than reinforces the base model's limitations.
  3. [Evaluation] Evaluation section: only aggregate benchmark scores are given for the final models. The absence of ablation tables isolating the iterative component, error bars across multiple seeds, or comparisons against single-round synthesis with matched data volume leaves the source of any observed gains (self-improvement vs. scale vs. base model) unidentified.
minor comments (2)
  1. [Abstract] The abstract states that ten datasets are used but does not enumerate them; a compact table or appendix reference would aid reproducibility.
  2. [Inference stage] Inference-stage RM guidance is mentioned but the precise algorithm (best-of-N, process supervision, etc.) and hyper-parameters are not specified.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our technical report. The points raised are important for substantiating the self-improvement claims, and we respond to each below. Revisions have been made to the manuscript to provide greater transparency on the post-training pipeline and evaluation.

read point-by-point responses
  1. Referee: [Post-training phase] Post-training phase (abstract and §3): the central claim that iterative RM-SFT evolution produces net gains in data quality rests on the untested assumption that repeated sampling plus RM filtering avoids distributional shift or reward hacking. No per-iteration quality metrics, error rates on generated traces, or control runs that hold total tokens fixed while disabling iteration are reported.

    Authors: We agree that explicit per-iteration metrics and control experiments would provide stronger support for our claims. In the revised manuscript, we have added per-iteration quality metrics in Section 3, including the evolution of average RM scores and the percentage of samples passing the reward threshold. We also include a qualitative error analysis on generated mathematical traces. For the control runs with matched token counts, we note this as a limitation due to computational constraints and have added a discussion in the limitations section. A partial comparison to non-iterative data synthesis is provided using available resources. revision: partial

  2. Referee: [Abstract and post-training] Abstract and post-training description: the RM is first trained on samples from Qwen2-Math-Instruct and then used to filter the next SFT round, creating an explicit circular dependency; without reported RM accuracy on held-out human data, agreement statistics, or analysis of mode collapse, it is impossible to verify that the loop improves rather than reinforces the base model's limitations.

    Authors: The design intentionally uses iterative updates to the RM to leverage improving model capabilities and reduce the risk of reinforcing initial limitations. We have revised the post-training section to include the RM accuracy on held-out human-annotated data, along with inter-rater agreement statistics between the RM and human evaluators. Additionally, we have added an analysis of response diversity across iterations to address potential mode collapse. These revisions allow for better verification that the process leads to net improvements. revision: yes

  3. Referee: [Evaluation] Evaluation section: only aggregate benchmark scores are given for the final models. The absence of ablation tables isolating the iterative component, error bars across multiple seeds, or comparisons against single-round synthesis with matched data volume leaves the source of any observed gains (self-improvement vs. scale vs. base model) unidentified.

    Authors: We concur that ablations are necessary to attribute the gains. The revised evaluation section now features an ablation study isolating the iterative self-improvement component, with comparisons to single-round SFT/RL using similar data volumes. We have also included error bars representing standard deviation over multiple evaluation runs on the main benchmarks. While full multi-seed training was not feasible, these additions help identify the contributions of self-improvement. revision: partial

standing simulated objections not resolved
  • Full control experiments that hold total tokens fixed while disabling iteration, due to high computational costs.

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper describes an empirical iterative pipeline (pre-training data generation from Qwen2-Math-Instruct, RM training via sampling from the same model, iterative SFT data evolution, RM updates, and final RL) but presents no equations, first-principles derivations, or uniqueness theorems whose outputs reduce to inputs by construction. The self-improvement process is a standard training loop whose net gains are claimed via end-to-end benchmarks rather than tautological redefinitions or fitted parameters renamed as predictions. Self-citations to prior Qwen models exist but are not load-bearing for the central claim, which remains an independently verifiable empirical procedure without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that self-generated data plus reward-model selection produces net improvement; no independent external benchmarks or formal proofs are referenced in the abstract.

axioms (2)
  • domain assumption Reward model trained on model-generated samples accurately ranks mathematical correctness
    Invoked when the RM is used to guide SFT data iteration and RL
  • ad hoc to paper Iterative self-sampling does not introduce compounding distributional shift or reward hacking
    Required for the claim that each round improves the next

pith-pipeline@v0.9.0 · 5661 in / 1353 out tokens · 41508 ms · 2026-05-11T10:10:18.156385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

    cs.LG 2026-05 conditional novelty 8.0

    Conformal Selective Acting (CSA) fills a gap in conformal methods by providing per-round, pathwise-valid selective risk bounds for adaptive RLVR LLM streams under predictable updates and isotonic calibration.

  2. FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

    cs.AI 2026-05 conditional novelty 8.0

    FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.

  3. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  4. Weak-to-Strong Elicitation via Mismatched Wrong Drafts

    cs.CL 2026-05 conditional novelty 7.0

    Mismatched wrong drafts from a 1.5B math model injected into GRPO training of a 7B model yield higher pass rates on MATH-500 and AIME than on-policy baselines or matched variants.

  5. AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...

  6. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  7. Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

    cs.LG 2026-05 conditional novelty 7.0

    ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...

  8. Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

  9. From Noise to Diversity: Random Embedding Injection in LLM Reasoning

    cs.AI 2026-05 conditional novelty 7.0

    Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.

  10. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  11. PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...

  12. PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...

  13. Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

    cs.LG 2026-05 unverdicted novelty 7.0

    CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.

  14. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 7.0

    An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

  15. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  16. Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.

  17. E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

    cs.CR 2026-05 unverdicted novelty 7.0

    E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...

  18. Validity-Calibrated Reasoning Distillation

    cs.LG 2026-04 unverdicted novelty 7.0

    Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.

  19. Validity-Calibrated Reasoning Distillation

    cs.LG 2026-04 unverdicted novelty 7.0

    Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.

  20. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  21. How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

    cs.CL 2026-03 conditional novelty 7.0

    TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

  22. PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

    cs.AI 2026-03 conditional novelty 7.0

    PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.

  23. BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

    cs.IR 2026-01 conditional novelty 7.0

    BEAR adds a beam-search-aware regularization to LLM fine-tuning for recommendations that forces positive-item tokens to rank in the top-B candidates at each decoding step to avoid premature pruning.

  24. Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

    cs.LG 2025-07 unverdicted novelty 7.0

    An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

  25. Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

    cs.LG 2025-07 unverdicted novelty 7.0

    Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

  26. Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

    cs.AI 2025-05 unverdicted novelty 7.0

    UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.

  27. Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    cs.LG 2025-04 accept novelty 7.0

    One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

  28. Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    cs.CL 2024-10 conditional novelty 7.0

    Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

  29. CLORE: Content-Level Optimization for Reasoning Efficiency

    cs.AI 2026-05 unverdicted novelty 6.0

    CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

  30. You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

    cs.LG 2026-05 unverdicted novelty 6.0

    RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.

  31. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 6.0

    DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...

  32. How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

    cs.LG 2026-05 conditional novelty 6.0

    Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.

  33. Dynamic Model Merging Made Slim

    cs.LG 2026-05 unverdicted novelty 6.0

    DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

  34. SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.

  35. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

  36. STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

  37. Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  38. Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    ConSPO introduces a contrastive sequence-level policy optimization that aligns rollout scores with generation likelihoods via length-normalized log-probabilities and an InfoNCE-style group contrast with curriculum mar...

  39. Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.

  40. Scalable Token-Level Hallucination Detection in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...

  41. Holder Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  42. Holder Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

  43. Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving

    cs.AI 2026-05 unverdicted novelty 6.0

    Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.

  44. Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

    cs.CL 2026-05 unverdicted novelty 6.0

    Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.

  45. Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

    cs.LG 2026-05 unverdicted novelty 6.0

    Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

  46. DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

  47. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  48. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 6.0

    An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.

  49. Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.

  50. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  51. DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

    cs.LG 2026-05 unverdicted novelty 6.0

    DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.

  52. DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

    cs.LG 2026-05 unverdicted novelty 6.0

    DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...

  53. Controllable and Verifiable Process Data Synthesis for Process Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.

  54. Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.

  55. Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

    cs.LG 2026-05 conditional novelty 6.0

    DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

  56. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  57. Cost-Aware Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

  58. Select to Think: Unlocking SLM Potential with Local Sufficiency

    cs.CL 2026-04 conditional novelty 6.0

    Small language models can achieve near large-model reasoning performance by learning to re-rank their own top-K token predictions after distilling selection from the large model.

  59. When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

  60. SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 120 Pith papers

  1. [1]

    For s(P (0)) to have the same parity as s(0), b0 must be 0

    **Case 1: k ≡ 0 mod 2 ** P (0) ≡ b0 mod 2 . For s(P (0)) to have the same parity as s(0), b0 must be 0. Therefore, a0 must be even

  2. [2]

    For s(P (1)) to have the same parity as s(1), 1 + bn−1 + · · · + b1 + b0 must be odd

    **Case 2: k ≡ 1 mod 2 ** P (1) ≡ 1n + bn−1 · 1n−1 + · · · + b1 · 1 + b0 ≡ 1 + bn−1 + · · · + b1 + b0 mod 2 . For s(P (1)) to have the same parity as s(1), 1 + bn−1 + · · · + b1 + b0 must be odd. Since 1 is odd, the sum bn−1 + · · · + b1 + b0 must be even. From the above cases, we have: - a0 must be even. - The sum bn−1 + · · · + b1 + b0 must be even. Howe...

  3. [3]

    Let’s assume f (x) = kx for some integer k

    f (4x) = f (3x) + f (x) SUBSTITUTION 3: y = 1 IN THE FIRST EQUATION f (3 · 1) = f (1) + f (2 · 1) f (3) = f (1) + f (2) SUBSTITUTION 4: x = 1 IN THE SECOND EQUATION f (4 · 1) = f (3 · 1) + f (1) f (4) = f (3) + f (1) Step 3: Using the Equations to Find a Pattern We have: f (3) = f (1) + f (2) f (4) = f (3) + f (1) Substitute f (3) from the first equation ...

  4. [4]

    - Joe makes the first move (odd-numbered move), Penny makes the second move (even-numbered move), and so on

    **Understanding the Game Rules:** - Initially, there are 5000 stones. - Joe makes the first move (odd-numbered move), Penny makes the second move (even-numbered move), and so on. - On the k-th move, a player can remove between 1 and k stones. - The player who removes the last stone wins

  5. [5]

    - The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000

    **Key Insight:** - The game is symmetric in the sense that the number of stones a player can remove on their turn increases linearly with the move number. - The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000

  6. [6]

    - Joe makes the odd-numbered moves: 1st, 3rd, 5th, ..., 4999th

    **Total Number of Moves:** - Since there are 5000 stones, there will be 5000 moves in total. - Joe makes the odd-numbered moves: 1st, 3rd, 5th, ..., 4999th. - Penny makes the even-numbered moves: 2nd, 4th, 6th, ..., 5000th

  7. [7]

    - Therefore, Penny will make the last move

    **Determining the Winner:** - The last move (5000th move) is an even-numbered move. - Therefore, Penny will make the last move

  8. [8]

    - Hence, Penny wins the game if both players play perfectly

    **Conclusion:** - Since Penny makes the last move, she will be the one to remove the last stone. - Hence, Penny wins the game if both players play perfectly. Thus, the final answer is: Penny Problem From Math Odyssey The school offers 10 elective courses, and each student can enroll in any number of courses. The director selects k students, where although...

  9. [9]

    - E is the foot of the altitude from B to AC

    Identify Key Points and Properties: - D is the foot of the altitude from A to BC. - E is the foot of the altitude from B to AC. - F is the foot of the altitude from C to AB. - H is the orthocenter of △ABC. - G is the intersection of the line through D parallel to AB and the line EF

  10. [10]

    - The points D, E, and F are collinear with H on the respective altitudes

    Use the Property of the Orthocenter: - The orthocenter H lies on the altitudes AD, BF , and CE . - The points D, E, and F are collinear with H on the respective altitudes

  11. [11]

    Analyze the Parallel Line: - Since DG ∥ AB, ∠ADG = ∠DAB (corresponding angles)

  12. [12]

    - Therefore, ∠EHF = 180 ◦ − ∠EAF

    Consider the Cyclic Quadrilateral: - The quadrilateral AEHF is cyclic because ∠AEH = ∠AF H = 90 ◦. - Therefore, ∠EHF = 180 ◦ − ∠EAF

  13. [13]

    Use the Parallel Line Property Again: - Since DG ∥ AB, ∠DGE = ∠BAE (alternate interior angles)

  14. [14]

    Relate Angles in Cyclic Quadrilateral: - In △AEF , ∠EAF = ∠EHF because they are opposite angles in the cyclic quadrilateral AEHF

  15. [15]

    - ∠BAE = ∠EAF (as established)

    Determine ∠CGH : - Since G lies on EF and DG ∥ AB, ∠DGE = ∠BAE. - ∠BAE = ∠EAF (as established). - Therefore, ∠DGE = ∠EAF

  16. [16]

    献爱心--为汶川地震区捐款

    Final Angle Calculation: - Since ∠EAF = ∠EHF and ∠EHF = 90 ◦ (as H is the orthocenter and E and F are feet of perpendiculars), we have: ∠CGH = ∠EHF = 90 ◦. Thus, the angle ∠CGH is 90◦ . B P ROMPTS USED IN THE EVALUATION Fig 5 to Fig 10 show the prompts used in evaluating the base models. Fig 11 to Fig 14 show the prompts used in evaluating the instruct mo...

  17. [17]

    如果$\alpha, \beta, \gamma$ 均小于$60^\circ$,那么他们的正弦值都小于$\frac{1}{2}$,因此三个值中 不可能有大于$\frac{1}{2}$ 的值。 \newline2. 如果有一个角大于$60^\circ$,假设为$\alpha$,那么对应 的正弦值大于$\frac{1}{2}$。此时,由于三角形内角和为$180^\circ$,所以$\beta + \gamma < 120^\circ$。 这意味着$\beta, \gamma$ 的余弦值均大于$\frac{1}{2}$,所以此时$\sin \alpha \cos \beta > \frac{1}{2}, \sin \beta \cos \gamma > \frac{1}{2}$。 \newline3. 如果有两...

  18. [18]

    $\left\{a_{n}\right\}$ 为递增数列

    如果三个角都大于$60^\circ$,显然不符合题意。 \newline综上所述,当有一个角大于$60^\circ$ 时, 大于$\frac{1}{2}$ 的个数的最大值是2。 答案是C 正方体$A B C D-A_{1} B_{1} C_{1} D_{1}$ 中, $B B_{1}$ 与平面$A C D_{1}$ 所成角的余弦值为( ) 从以下选项中选择: :\newline(A) $\frac{\sqrt{2}}{3}$ :\newline(B) $\frac{\sqrt{3}}{3}$ :\newline(C) $\frac{2}{3}$ :\newline(D) $\frac{\sqrt{6}}{3}$ 设上下底面的中心分别为$\mathrm{O}_{1}, \mathrm{O}$, 设正方体的棱...