arxiv: 2409.12122 · v1 · submitted 2024-09-18 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 1 theorem link

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang , Beichen Zhang , Binyuan Hui , Bofei Gao , Bowen Yu , Chengpeng Li , Dayiheng Liu , Jianhong Tu

show 8 more authors

Jingren Zhou Junyang Lin Keming Lu Mingfeng Xue Runji Lin Tianyu Liu Xingzhang Ren Zhenru Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 10:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mathematical reasoningself-improvementlarge language modelsreward modelsupervised fine-tuningreinforcement learningchain-of-thoughttool-integrated reasoning

0 comments

The pith

Integrating self-improvement across pre-training, post-training, and inference produces math-specialized models with stronger reasoning on competition problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The report describes a series of Qwen2.5-Math models built by embedding self-improvement into the full development pipeline. An earlier model generates large-scale math data for pre-training. A reward model trained on massive samples then filters and iterates supervised fine-tuning data, with the improved model in turn training a better reward model for the next round. The final model undergoes reinforcement learning guided by the reward model, and the same model directs sampling at inference time to optimize outputs. If successful, this loop offers a path to domain-specialized models that generate and refine their own higher-quality training data without heavy reliance on external curation. Readers would care because it tests whether iterative filtering can steadily advance performance on tasks ranging from grade-school arithmetic to advanced competition problems in both English and Chinese.

Core claim

The authors claim that a closed self-improvement loop—using the current model to generate data, scoring it with a reward model derived from prior samples, and retraining—yields progressive gains in mathematical capability. This cycle runs through pre-training data creation, multiple rounds of supervised fine-tuning data evolution, reinforcement learning, and reward-guided inference, resulting in models that handle both chain-of-thought and tool-integrated reasoning on English and Chinese math benchmarks.

What carries the argument

The reward model obtained from massive sampling of model outputs, which filters data for iterative supervised fine-tuning, guides reinforcement learning, and steers inference sampling.

If this is right

The final models support both chain-of-thought and tool-integrated reasoning on grade-school to competition-level problems.
Iterative reward-model updates allow each stronger supervised fine-tuning model to train an improved reward model for the next cycle.
Reinforcement learning on the final supervised model uses the ultimate reward model to further refine outputs.
Reward-guided sampling at inference time improves answer quality on the evaluated English and Chinese datasets.
The approach covers both Chinese and English mathematical reasoning across ten benchmarks of varying difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling-plus-filtering loop might be applied to other specialized domains if a reliable reward model can be built for those domains.
Success would imply that models can bootstrap expertise in reasoning-heavy fields with less dependence on human-labeled data.
If the cycle can be sustained without mode collapse, it raises the possibility of continued performance gains through repeated self-refinement rounds.
The bilingual capability suggests the method preserves or enhances cross-lingual transfer when data generation and filtering are applied to mixed-language corpora.

Load-bearing premise

Repeated sampling plus reward-model filtering produces steadily higher-quality math data without compounding errors or narrowing the model's output distribution.

What would settle it

A controlled experiment in which the iterative reward-model data evolution steps are removed and the resulting models show no gain or a loss on held-out competition benchmarks such as AIME24 or MATH would falsify the claim that the self-improvement pipeline drives the observed performance.

read the original abstract

In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen2.5-Math gives a concrete end-to-end recipe for math-specialized models using iterative self-improvement from pre-training through inference, but the report shows only final benchmark numbers without controls that would confirm the loops themselves drive the gains.

read the letter

The key point for you is that Qwen2.5-Math applies self-improvement across pre-training data generation, iterative RM-SFT loops in post-training, RL, and inference-time guidance, and the resulting models hit strong numbers on math benchmarks like MATH and AIME24. What stands out is the detailed recipe they give for the whole process. They start with Qwen2-Math-Instruct to create pre-training data, then sample massively to train a reward model, use that to filter and evolve SFT data iteratively while updating the RM, apply RL at the end, and finally use the RM to guide inference. This covers both languages and includes tool-integrated reasoning. The report evaluates on ten datasets spanning easy to hard problems, which is useful for seeing where the models land. The paper does a good job documenting the pipeline in one place, which makes it easier for others to build on. The models themselves are new artifacts that people can download and test. On the downside, the evidence for the self-improvement being the driver is thin. The results are all end-to-end, with no ablations that compare the iterative version to a non-iterative one using the same total data volume or compute. There are also no reported metrics on data quality per round or how they avoid error accumulation in the loop. That makes it difficult to tell if the gains are from the method or just from training on a lot of math data generated by a capable base model. This report is aimed at people working on LLM post-training for reasoning tasks. A reader who wants a practical example of self-improvement in action will get value from the steps described, even if they have to fill in some implementation details themselves. I think it deserves a serious referee. The work is grounded in existing ideas but executes them at scale with new models, so peer review could help clarify the contributions and push for more controls in a revision.

Referee Report

3 major / 2 minor

Summary. The paper presents the Qwen2.5-Math series (1.5B/7B/72B) whose core contribution is integrating self-improvement across the full pipeline: pre-training data generation from Qwen2-Math-Instruct, post-training iterative RM training from model samples followed by SFT data evolution and RL, and inference-time RM-guided sampling. The models are evaluated on ten English and Chinese math benchmarks spanning grade-school to competition level (GSM8K, MATH, AIME24, etc.).

Significance. If the iterative self-improvement loop demonstrably improves data quality without compounding errors, the approach would provide a scalable, largely automated route to stronger mathematical reasoning models and reduce reliance on human-curated corpora. The manuscript currently supplies only end-to-end benchmark numbers, so the practical significance cannot yet be assessed.

major comments (3)

[Post-training phase] Post-training phase (abstract and §3): the central claim that iterative RM-SFT evolution produces net gains in data quality rests on the untested assumption that repeated sampling plus RM filtering avoids distributional shift or reward hacking. No per-iteration quality metrics, error rates on generated traces, or control runs that hold total tokens fixed while disabling iteration are reported.
[Abstract and post-training] Abstract and post-training description: the RM is first trained on samples from Qwen2-Math-Instruct and then used to filter the next SFT round, creating an explicit circular dependency; without reported RM accuracy on held-out human data, agreement statistics, or analysis of mode collapse, it is impossible to verify that the loop improves rather than reinforces the base model's limitations.
[Evaluation] Evaluation section: only aggregate benchmark scores are given for the final models. The absence of ablation tables isolating the iterative component, error bars across multiple seeds, or comparisons against single-round synthesis with matched data volume leaves the source of any observed gains (self-improvement vs. scale vs. base model) unidentified.

minor comments (2)

[Abstract] The abstract states that ten datasets are used but does not enumerate them; a compact table or appendix reference would aid reproducibility.
[Inference stage] Inference-stage RM guidance is mentioned but the precise algorithm (best-of-N, process supervision, etc.) and hyper-parameters are not specified.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our technical report. The points raised are important for substantiating the self-improvement claims, and we respond to each below. Revisions have been made to the manuscript to provide greater transparency on the post-training pipeline and evaluation.

read point-by-point responses

Referee: [Post-training phase] Post-training phase (abstract and §3): the central claim that iterative RM-SFT evolution produces net gains in data quality rests on the untested assumption that repeated sampling plus RM filtering avoids distributional shift or reward hacking. No per-iteration quality metrics, error rates on generated traces, or control runs that hold total tokens fixed while disabling iteration are reported.

Authors: We agree that explicit per-iteration metrics and control experiments would provide stronger support for our claims. In the revised manuscript, we have added per-iteration quality metrics in Section 3, including the evolution of average RM scores and the percentage of samples passing the reward threshold. We also include a qualitative error analysis on generated mathematical traces. For the control runs with matched token counts, we note this as a limitation due to computational constraints and have added a discussion in the limitations section. A partial comparison to non-iterative data synthesis is provided using available resources. revision: partial
Referee: [Abstract and post-training] Abstract and post-training description: the RM is first trained on samples from Qwen2-Math-Instruct and then used to filter the next SFT round, creating an explicit circular dependency; without reported RM accuracy on held-out human data, agreement statistics, or analysis of mode collapse, it is impossible to verify that the loop improves rather than reinforces the base model's limitations.

Authors: The design intentionally uses iterative updates to the RM to leverage improving model capabilities and reduce the risk of reinforcing initial limitations. We have revised the post-training section to include the RM accuracy on held-out human-annotated data, along with inter-rater agreement statistics between the RM and human evaluators. Additionally, we have added an analysis of response diversity across iterations to address potential mode collapse. These revisions allow for better verification that the process leads to net improvements. revision: yes
Referee: [Evaluation] Evaluation section: only aggregate benchmark scores are given for the final models. The absence of ablation tables isolating the iterative component, error bars across multiple seeds, or comparisons against single-round synthesis with matched data volume leaves the source of any observed gains (self-improvement vs. scale vs. base model) unidentified.

Authors: We concur that ablations are necessary to attribute the gains. The revised evaluation section now features an ablation study isolating the iterative self-improvement component, with comparisons to single-round SFT/RL using similar data volumes. We have also included error bars representing standard deviation over multiple evaluation runs on the main benchmarks. While full multi-seed training was not feasible, these additions help identify the contributions of self-improvement. revision: partial

standing simulated objections not resolved

Full control experiments that hold total tokens fixed while disabling iteration, due to high computational costs.

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper describes an empirical iterative pipeline (pre-training data generation from Qwen2-Math-Instruct, RM training via sampling from the same model, iterative SFT data evolution, RM updates, and final RL) but presents no equations, first-principles derivations, or uniqueness theorems whose outputs reduce to inputs by construction. The self-improvement process is a standard training loop whose net gains are claimed via end-to-end benchmarks rather than tautological redefinitions or fitted parameters renamed as predictions. Self-citations to prior Qwen models exist but are not load-bearing for the central claim, which remains an independently verifiable empirical procedure without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that self-generated data plus reward-model selection produces net improvement; no independent external benchmarks or formal proofs are referenced in the abstract.

axioms (2)

domain assumption Reward model trained on model-generated samples accurately ranks mathematical correctness
Invoked when the RM is used to guide SFT data iteration and RL
ad hoc to paper Iterative self-sampling does not introduce compounding distributional shift or reward hacking
Required for the claim that each round improves the next

pith-pipeline@v0.9.0 · 5661 in / 1353 out tokens · 41508 ms · 2026-05-11T10:10:18.156385+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
cs.AI 2026-05 conditional novelty 8.0

FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
cs.LG 2026-05 unverdicted novelty 7.0

DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
cs.AI 2026-05 conditional novelty 7.0

Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
cs.LG 2026-05 unverdicted novelty 7.0

CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
cs.CR 2026-05 unverdicted novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
cs.CL 2026-03 conditional novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
H\"older Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
cs.AI 2026-05 unverdicted novelty 6.0

Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
cs.CL 2026-05 unverdicted novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
cs.LG 2026-05 unverdicted novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 6.0

An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.
Controllable and Verifiable Process Data Synthesis for Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
cs.LG 2026-05 conditional novelty 6.0

DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Cost-Aware Learning
cs.LG 2026-04 unverdicted novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Select to Think: Unlocking SLM Potential with Local Sufficiency
cs.CL 2026-04 conditional novelty 6.0

Small language models can achieve near large-model reasoning performance by learning to re-rank their own top-K token predictions after distilling selection from the large model.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
cs.AI 2026-04 unverdicted novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
cs.LG 2026-04 conditional novelty 6.0

Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
PARM: Pipeline-Adapted Reward Model
cs.AI 2026-04 unverdicted novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
cs.CL 2026-04 unverdicted novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Calibration-Aware Policy Optimization for Reasoning LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
cs.LG 2026-04 unverdicted novelty 6.0

Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
cs.CL 2025-04 unverdicted novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
cs.AI 2026-05 unverdicted novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
cs.LG 2026-05 unverdicted novelty 5.0

NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Selective pruning of low-activation neurons in task-specific LLMs preserves accuracy better than random pruning, but removing roughly 10% of highly selective neurons triggers total collapse, with fine-tuning recoverin...

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 67 Pith papers

[1]

For s(P (0)) to have the same parity as s(0), b0 must be 0

**Case 1: k ≡ 0 mod 2 ** P (0) ≡ b0 mod 2 . For s(P (0)) to have the same parity as s(0), b0 must be 0. Therefore, a0 must be even

work page
[2]

For s(P (1)) to have the same parity as s(1), 1 + bn−1 + · · · + b1 + b0 must be odd

**Case 2: k ≡ 1 mod 2 ** P (1) ≡ 1n + bn−1 · 1n−1 + · · · + b1 · 1 + b0 ≡ 1 + bn−1 + · · · + b1 + b0 mod 2 . For s(P (1)) to have the same parity as s(1), 1 + bn−1 + · · · + b1 + b0 must be odd. Since 1 is odd, the sum bn−1 + · · · + b1 + b0 must be even. From the above cases, we have: - a0 must be even. - The sum bn−1 + · · · + b1 + b0 must be even. Howe...

work page 2022
[3]

Let’s assume f (x) = kx for some integer k

f (4x) = f (3x) + f (x) SUBSTITUTION 3: y = 1 IN THE FIRST EQUATION f (3 · 1) = f (1) + f (2 · 1) f (3) = f (1) + f (2) SUBSTITUTION 4: x = 1 IN THE SECOND EQUATION f (4 · 1) = f (3 · 1) + f (1) f (4) = f (3) + f (1) Step 3: Using the Equations to Find a Pattern We have: f (3) = f (1) + f (2) f (4) = f (3) + f (1) Substitute f (3) from the first equation ...

work page 2023
[4]

- Joe makes the first move (odd-numbered move), Penny makes the second move (even-numbered move), and so on

**Understanding the Game Rules:** - Initially, there are 5000 stones. - Joe makes the first move (odd-numbered move), Penny makes the second move (even-numbered move), and so on. - On the k-th move, a player can remove between 1 and k stones. - The player who removes the last stone wins

work page
[5]

- The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000

**Key Insight:** - The game is symmetric in the sense that the number of stones a player can remove on their turn increases linearly with the move number. - The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000

work page
[6]

- Joe makes the odd-numbered moves: 1st, 3rd, 5th, ..., 4999th

**Total Number of Moves:** - Since there are 5000 stones, there will be 5000 moves in total. - Joe makes the odd-numbered moves: 1st, 3rd, 5th, ..., 4999th. - Penny makes the even-numbered moves: 2nd, 4th, 6th, ..., 5000th

work page
[7]

- Therefore, Penny will make the last move

**Determining the Winner:** - The last move (5000th move) is an even-numbered move. - Therefore, Penny will make the last move

work page
[8]

- Hence, Penny wins the game if both players play perfectly

**Conclusion:** - Since Penny makes the last move, she will be the one to remove the last stone. - Hence, Penny wins the game if both players play perfectly. Thus, the final answer is: Penny Problem From Math Odyssey The school offers 10 elective courses, and each student can enroll in any number of courses. The director selects k students, where although...

work page 2010
[9]

- E is the foot of the altitude from B to AC

Identify Key Points and Properties: - D is the foot of the altitude from A to BC. - E is the foot of the altitude from B to AC. - F is the foot of the altitude from C to AB. - H is the orthocenter of △ABC. - G is the intersection of the line through D parallel to AB and the line EF

work page
[10]

- The points D, E, and F are collinear with H on the respective altitudes

Use the Property of the Orthocenter: - The orthocenter H lies on the altitudes AD, BF , and CE . - The points D, E, and F are collinear with H on the respective altitudes

work page
[11]

Analyze the Parallel Line: - Since DG ∥ AB, ∠ADG = ∠DAB (corresponding angles)

work page
[12]

- Therefore, ∠EHF = 180 ◦ − ∠EAF

Consider the Cyclic Quadrilateral: - The quadrilateral AEHF is cyclic because ∠AEH = ∠AF H = 90 ◦. - Therefore, ∠EHF = 180 ◦ − ∠EAF

work page
[13]

Use the Parallel Line Property Again: - Since DG ∥ AB, ∠DGE = ∠BAE (alternate interior angles)

work page
[14]

Relate Angles in Cyclic Quadrilateral: - In △AEF , ∠EAF = ∠EHF because they are opposite angles in the cyclic quadrilateral AEHF

work page
[15]

- ∠BAE = ∠EAF (as established)

Determine ∠CGH : - Since G lies on EF and DG ∥ AB, ∠DGE = ∠BAE. - ∠BAE = ∠EAF (as established). - Therefore, ∠DGE = ∠EAF

work page
[16]

献爱心--为汶川地震区捐款

Final Angle Calculation: - Since ∠EAF = ∠EHF and ∠EHF = 90 ◦ (as H is the orthocenter and E and F are feet of perpendiculars), we have: ∠CGH = ∠EHF = 90 ◦. Thus, the angle ∠CGH is 90◦ . B P ROMPTS USED IN THE EVALUATION Fig 5 to Fig 10 show the prompts used in evaluating the base models. Fig 11 to Fig 14 show the prompts used in evaluating the instruct mo...

work page 2004
[17]

如果$\alpha, \beta, \gamma$ 均小于$60^\circ$，那么他们的正弦值都小于$\frac{1}{2}$，因此三个值中不可能有大于$\frac{1}{2}$ 的值。 \newline2. 如果有一个角大于$60^\circ$，假设为$\alpha$，那么对应的正弦值大于$\frac{1}{2}$。此时，由于三角形内角和为$180^\circ$，所以$\beta + \gamma < 120^\circ$。这意味着$\beta, \gamma$ 的余弦值均大于$\frac{1}{2}$，所以此时$\sin \alpha \cos \beta > \frac{1}{2}, \sin \beta \cos \gamma > \frac{1}{2}$。 \newline3. 如果有两...

work page
[18]

$\left\{a_{n}\right\}$ 为递增数列

如果三个角都大于$60^\circ$，显然不符合题意。 \newline综上所述，当有一个角大于$60^\circ$ 时，大于$\frac{1}{2}$ 的个数的最大值是2。答案是C 正方体$A B C D-A_{1} B_{1} C_{1} D_{1}$ 中, $B B_{1}$ 与平面$A C D_{1}$ 所成角的余弦值为( ) 从以下选项中选择: :\newline(A) $\frac{\sqrt{2}}{3}$ :\newline(B) $\frac{\sqrt{3}}{3}$ :\newline(C) $\frac{2}{3}$ :\newline(D) $\frac{\sqrt{6}}{3}$ 设上下底面的中心分别为$\mathrm{O}_{1}, \mathrm{O}$, 设正方体的棱...

work page