WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Can Xu; Chongyang Tao; Dongmei Zhang; Haipeng Luo; Jianguang Lou; Pu Zhao; Qingfeng Sun; Qingwei Lin; Shifeng Chen; Xiubo Geng

arxiv: 2308.09583 · v3 · pith:L6ZPUUAInew · submitted 2023-08-18 · 💻 cs.CL · cs.AI· cs.LG

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo , Qingfeng Sun , Can Xu , Pu Zhao , Jianguang Lou , Chongyang Tao , Xiubo Geng , Qingwei Lin

show 3 more authors

Shifeng Chen Yansong Tang Dongmei Zhang

This is my paper

Pith reviewed 2026-05-17 03:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mathematical reasoninglarge language modelsreinforcement learninginstruction evolutionchain of thoughtGSM8KMATH benchmark

0 comments

The pith

WizardMath applies reinforced evol-instruct feedback to boost LLMs' math chain-of-thought reasoning without external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WizardMath as a way to strengthen mathematical reasoning in large language models by using Reinforcement Learning from Evol-Instruct Feedback, or RLEIF. This process evolves instructions and applies process supervision directly in the math domain. A sympathetic reader would care because the resulting models, especially the 70B version, reach or exceed the performance of closed models such as GPT-3.5-Turbo, Claude 2, Gemini Pro, and early GPT-4 on GSM8K and MATH benchmarks. The work also points to instruction evolution and process supervision as central to these gains and shows strong results even with the smaller Mistral 7B base.

Core claim

By applying Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to the math domain, WizardMath enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, yielding WizardMath-Mistral 7B that surpasses top-tier open-source LLMs and WizardMath 70B that outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version on GSM8K and MATH.

What carries the argument

Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which evolves math instructions iteratively and reinforces correct reasoning steps through feedback.

Load-bearing premise

The reported performance gains stem primarily from the RLEIF procedure rather than from differences in the base model, data mixture, or evaluation protocol.

What would settle it

A controlled experiment that fine-tunes the identical base models on the same evolved instructions but omits the reinforcement learning feedback loop and checks whether the large accuracy lifts on GSM8K and MATH still appear.

read the original abstract

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WizardMath shows open models can hit strong math numbers with evolved data plus RL process feedback, but the RL step still needs isolation from the data changes.

read the letter

The main thing here is that WizardMath layers reinforcement learning from process feedback on top of Evol-Instruct data to lift math reasoning in open LLMs. The 7B Mistral version beats other open models on GSM8k and MATH with better data efficiency, and the 70B version claims to surpass GPT-3.5-Turbo, Claude 2, Gemini Pro, and an early GPT-4 on those benchmarks without external tools. They also flag the importance of instruction evolution and process supervision in a short exploration section. The GitHub release helps with checking the pipeline. This is a straightforward extension of prior work on data generation and RL for reasoning, and the benchmark margins are the concrete advance. The experimental gaps are the real issue. No direct comparison appears between the full RLEIF model and an SFT-only run on the same evolved dataset, so it is hard to tell whether the RL step drives the gains or if the data quality alone would have done most of the work. The abstract also omits error bars, prompt formatting details, and data filtering rules, which leaves the headline results resting on unreviewed setup choices. This paper is for groups working on practical math reasoning improvements in open models, especially those who want to try similar pipelines for tutoring or assistant tools. Readers who care about clean attribution of gains will want the missing ablations before treating the method as settled. It is worth sending to peer review because the performance claims are sharp enough that referees can test the training details and run controls themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces WizardMath, which applies Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to boost chain-of-thought mathematical reasoning in LLMs without external tools. It reports that WizardMath-Mistral-7B substantially outperforms leading open-source models on GSM8k and MATH, while WizardMath-70B surpasses GPT-3.5-Turbo, Claude 2, Gemini Pro, and an early GPT-4 variant; a preliminary analysis emphasizes the roles of instruction evolution and process supervision.

Significance. If the gains are shown to stem specifically from RLEIF rather than data quality or base-model differences, the work would provide a practical recipe for elevating open-source mathematical reasoning to near-proprietary levels using only evolved instructions and process-level RL, with the preliminary ablation-style exploration of evolution and supervision serving as a useful starting point for follow-on research.

major comments (2)

[Experiments] The manuscript provides no direct SFT-only baseline trained on the identical Evol-Instruct dataset before applying the RL stage. Without this comparison, it remains unclear whether the headline gains on GSM8k and MATH are driven by the RLEIF reinforcement step or simply by the quality of the evolved data.
[Abstract and §4] Benchmark results in the abstract and main results section are presented without error bars, details on prompt formatting, data exclusion criteria, or full training curves. These omissions make it difficult to assess the statistical reliability of the claim that WizardMath-70B outperforms the listed closed models.

minor comments (2)

[Conclusion] The GitHub link is referenced but the paper would benefit from an explicit statement of which artifacts (code, data splits, evaluation prompts) are released.
[Method] Notation for the process-supervision reward model could be introduced earlier and used consistently in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Experiments] The manuscript provides no direct SFT-only baseline trained on the identical Evol-Instruct dataset before applying the RL stage. Without this comparison, it remains unclear whether the headline gains on GSM8k and MATH are driven by the RLEIF reinforcement step or simply by the quality of the evolved data.

Authors: We agree that an explicit SFT-only baseline on the identical dataset would help isolate the contribution of the RLEIF stage. In the revised manuscript we will add results from such a baseline trained on the same Evol-Instruct data, allowing direct comparison of performance before and after the reinforcement learning phase. revision: yes
Referee: [Abstract and §4] Benchmark results in the abstract and main results section are presented without error bars, details on prompt formatting, data exclusion criteria, or full training curves. These omissions make it difficult to assess the statistical reliability of the claim that WizardMath-70B outperforms the listed closed models.

Authors: We acknowledge the value of these details for assessing reliability. The revised version will include error bars from repeated evaluations where computationally feasible, explicit prompt formatting descriptions, data exclusion criteria, and full training curves in the appendix to support the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks without self-referential derivations

full rationale

The paper introduces the RLEIF procedure and reports accuracy numbers on GSM8k and MATH, comparing WizardMath variants against GPT-3.5-Turbo, Claude 2, Gemini Pro and early GPT-4. No equations appear that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the central result. The derivation chain consists of standard RL training steps whose outputs are evaluated on independent test sets; therefore the headline performance numbers are not equivalent to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that process-level feedback signals are reliable and that benchmark scores reflect genuine reasoning gains.

pith-pipeline@v0.9.0 · 5533 in / 1083 out tokens · 20517 ms · 2026-05-17T03:56:42.931558+00:00 · methodology

discussion (0)

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories
cs.AI 2026-04 unverdicted novelty 7.0

CRPS synthesizes reasoning paths by contrasting high- and low-quality MCTS trajectories, enabling models trained on 60K examples to match or exceed those trained on 590K standard examples with better out-of-domain gen...
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
cs.AI 2025-12 unverdicted novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models
cs.SE 2025-10 unverdicted novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
cs.CR 2025-09 unverdicted novelty 7.0

SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
CodeMind: Evaluating Large Language Models for Code Reasoning
cs.SE 2024-02 unverdicted novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
cs.CL 2026-05 conditional novelty 6.0

DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
cs.SE 2026-05 unverdicted novelty 6.0

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonn...
Distribution Corrected Offline Data Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
cs.CV 2026-05 unverdicted novelty 6.0

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 6.0

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
cs.AI 2025-11 unverdicted novelty 6.0

MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
cs.AI 2025-07 unverdicted novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
cs.LG 2024-06 conditional novelty 6.0

Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
cs.CL 2024-02 unverdicted novelty 6.0

DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Llemma: An Open Language Model For Mathematics
cs.CL 2023-10 unverdicted novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
cs.CL 2023-09 conditional novelty 6.0

ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on...
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
cs.CL 2023-09 conditional novelty 6.0

MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

A neuro-symbolic engine generates GeoSym127K, a 127K-question dataset with symbolic ground truths and verified CoT pairs, yielding +22.21% gains on MathVerse Vision-Only after SFT on Qwen3-VL-8B.
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
cs.MA 2026-04 unverdicted novelty 5.0

ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
A Survey on LLM-as-a-Judge
cs.CL 2024-11 unverdicted novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
cs.AI 2025-01 unverdicted novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
A Survey on Knowledge Distillation of Large Language Models
cs.CL 2024-02 accept novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
cs.CL 2025-02 unverdicted novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 36 Pith papers · 1 internal anchor

[1]

URL https://api.semanticscholar.org/CorpusID:266818336. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Wint...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023 2020
[2]

doi: 10.18653/v1/n19-1421

Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https: //aclanthology.org/N19-1421. 14 Published as a conference paper at ICLR 2025 Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024. Rohan Taori, Ishaan Gulrajani, Tiany...

work page doi:10.18653/v1/n19-1421 2025
[3]

Instruction Evolution and SFT In the first step, we apply upward and downward instruction evolution on the GSM8k and MATH datasets, generating evolved instructions for the SFT. On the leftmost side of Figure 1, the three blue arrows, from top to bottom, represent: (a) the adoption of the instruction evolution technique, (b) the generation of evolved instr...

work page
[4]

A” represents the original instruction, while “B,

Reward Model Training The second step involves two reward models: the Instruction Quality Scoring Reward Model (IRM) and the Process-Supervised Reward Model (PRM), depicted in the central section of Figure 1. • IRM: We employ upward and downward evolution on a seed instruction, yielding five instructions (original + evolved). These instructions are ranked...

work page
[5]

As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM

Reinforcement Learning with PPO In the final step, we integrate the IRM and PRM within a Proximal Policy Optimization (PPO)-based reinforcement learning framework. As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM. (b) The second blue arrow shows PPO initializati...

work page 2025
[6]

• On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math

Performance Comparison: • On Llama-2-7B and Mistral-7B-v0.1, WizardMath-SFT performs marginally below SOTA models (i.e.,Xwin-Math and Skywork-Math) and outperforms existing other excellent models (i.e.,DART-Math). • On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math. • On all various base models, WizardMath-SFT s...

work page
[7]

Meanwhile, WizardMath-SFT demonstrates comparable or superior performance to advanced data synthesis methods, such as DART- Math and MetaMath, across all base models

Comparison with advanced data synthesis methods (i.e., DART-Math, MetaMath) As shown in the following Table 15, DART-Math demonstrates strong performance across various base models and the data synthesis method proposed by DART-Math shows the effectiveness and outstanding performance. Meanwhile, WizardMath-SFT demonstrates comparable or superior performan...

work page 2025
[8]

It also significantly enhances the mathematical reasoning capabilities of our models

The proposed Math Evol Instruct data synthesis method is also as effective and practical as the current state-of-the-art data synthesis methods, such as DART-Math, Skywork-Math and Xwin-Math in the SFT stage. It also significantly enhances the mathematical reasoning capabilities of our models

work page
[9]

The proposed IRM and PRM models substantially improve performance during the reinforcement learning phase. They not only continuously enhance the mathematical reasoning abilities of our 34 Published as a conference paper at ICLR 2025 Table 18: The performance comparison of WizardMath-SFT with DART-Math, Xwin-Math, and Skywork-Math on the Llama2-7B base mo...

work page 2025
[10]

In Table 6, we provide a detailed analysis of the effects of downward evolution

Unlike WizardLM/WizardCoder, which primarily focus on increasing instruction difficulty, we are the first to propose the novel concept of downward evolution, a major distinction in instruction evolution. In Table 6, we provide a detailed analysis of the effects of downward evolution. Specifically, two rounds of downward evolution led to a remarkable impro...

work page
[11]

In reinforcement learning (RL) training, we firstly propose the instruction quality scoring reward model (IRM) combined with the process supervision reward model (PRM) further enhancing WizardMath mathematical reasoning ability. As demonstrated in Table 3, our method achieves a remarkable 5%–8% improvement in GSM8k and MATH performance over the SFT backbo...

work page
[12]

Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems

We firstly propose to use AI to annotate the step-level PRM training data. Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems. This fully AI-automated data generation pipeline ensures scalability

work page
[13]

It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approach proposed in our study

WizardMath demonstrates outstanding performance across a wide range of model scales, from 100M to 1B and 70B parameters, on the benchmarks such as GSM8k, MATH, and out-of- distribution (OOD) tasks like MWPBench(Tang et al., 2024). It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approa...

work page 2024

[1] [1]

URL https://api.semanticscholar.org/CorpusID:266818336. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Wint...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023 2020

[2] [2]

doi: 10.18653/v1/n19-1421

Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https: //aclanthology.org/N19-1421. 14 Published as a conference paper at ICLR 2025 Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024. Rohan Taori, Ishaan Gulrajani, Tiany...

work page doi:10.18653/v1/n19-1421 2025

[3] [3]

Instruction Evolution and SFT In the first step, we apply upward and downward instruction evolution on the GSM8k and MATH datasets, generating evolved instructions for the SFT. On the leftmost side of Figure 1, the three blue arrows, from top to bottom, represent: (a) the adoption of the instruction evolution technique, (b) the generation of evolved instr...

work page

[4] [4]

A” represents the original instruction, while “B,

Reward Model Training The second step involves two reward models: the Instruction Quality Scoring Reward Model (IRM) and the Process-Supervised Reward Model (PRM), depicted in the central section of Figure 1. • IRM: We employ upward and downward evolution on a seed instruction, yielding five instructions (original + evolved). These instructions are ranked...

work page

[5] [5]

As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM

Reinforcement Learning with PPO In the final step, we integrate the IRM and PRM within a Proximal Policy Optimization (PPO)-based reinforcement learning framework. As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM. (b) The second blue arrow shows PPO initializati...

work page 2025

[6] [6]

• On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math

Performance Comparison: • On Llama-2-7B and Mistral-7B-v0.1, WizardMath-SFT performs marginally below SOTA models (i.e.,Xwin-Math and Skywork-Math) and outperforms existing other excellent models (i.e.,DART-Math). • On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math. • On all various base models, WizardMath-SFT s...

work page

[7] [7]

Meanwhile, WizardMath-SFT demonstrates comparable or superior performance to advanced data synthesis methods, such as DART- Math and MetaMath, across all base models

Comparison with advanced data synthesis methods (i.e., DART-Math, MetaMath) As shown in the following Table 15, DART-Math demonstrates strong performance across various base models and the data synthesis method proposed by DART-Math shows the effectiveness and outstanding performance. Meanwhile, WizardMath-SFT demonstrates comparable or superior performan...

work page 2025

[8] [8]

It also significantly enhances the mathematical reasoning capabilities of our models

The proposed Math Evol Instruct data synthesis method is also as effective and practical as the current state-of-the-art data synthesis methods, such as DART-Math, Skywork-Math and Xwin-Math in the SFT stage. It also significantly enhances the mathematical reasoning capabilities of our models

work page

[9] [9]

The proposed IRM and PRM models substantially improve performance during the reinforcement learning phase. They not only continuously enhance the mathematical reasoning abilities of our 34 Published as a conference paper at ICLR 2025 Table 18: The performance comparison of WizardMath-SFT with DART-Math, Xwin-Math, and Skywork-Math on the Llama2-7B base mo...

work page 2025

[10] [10]

In Table 6, we provide a detailed analysis of the effects of downward evolution

Unlike WizardLM/WizardCoder, which primarily focus on increasing instruction difficulty, we are the first to propose the novel concept of downward evolution, a major distinction in instruction evolution. In Table 6, we provide a detailed analysis of the effects of downward evolution. Specifically, two rounds of downward evolution led to a remarkable impro...

work page

[11] [11]

In reinforcement learning (RL) training, we firstly propose the instruction quality scoring reward model (IRM) combined with the process supervision reward model (PRM) further enhancing WizardMath mathematical reasoning ability. As demonstrated in Table 3, our method achieves a remarkable 5%–8% improvement in GSM8k and MATH performance over the SFT backbo...

work page

[12] [12]

Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems

We firstly propose to use AI to annotate the step-level PRM training data. Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems. This fully AI-automated data generation pipeline ensures scalability

work page

[13] [13]

It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approach proposed in our study

WizardMath demonstrates outstanding performance across a wide range of model scales, from 100M to 1B and 70B parameters, on the benchmarks such as GSM8k, MATH, and out-of- distribution (OOD) tasks like MWPBench(Tang et al., 2024). It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approa...

work page 2024